AI Toolbox
A curated collection of 610 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
MoMo is a new video frame interpolation method that is able to generate intermediate frames with high visual quality and reduced computational demands.
FreeTraj is a tuning-free approach that enables trajectory control in video diffusion models by modifying noise sampling and attention mechanisms.
Portrait3D can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.
MIRReS can reconstruct and optimize the explicit geometry, material, and lighting of objects from multi-view images. The resulting 3D models can be edited and relit in modern graphics engines or CAD software.
LiveScene can identify and control multiple objects in complex scenes. It is able to locate individual objects in different states and enables control of them using natural language.
MVOC is a training-free multiple video object composition method with diffusion models. The method can be used to composite multiple video objects into a single video while maintaining motion and identity consistency.
Conditional Image Leakage can be used to generate videos with more dynamic and natural motion from image prompts.
Image Conductor can generate video assets from a single image with precise control over camera transitions and object movements.
Mora can enable generalist video generation through a multi-agent framework. It supports text-to-video generation, video editing, and digital world simulation, achieving performance similar to the Sora model.
iCD can be used for zero-shot text-guided image editing with diffusion models. The method is able to encode real images into their latent space in only 3-4 inference steps and can then be used to edit the image with a text prompt.
EvTexture is a video super-resolution upscaling method that utilizes event signals for texture enhancement for more accurate texture and high-resolution detail recovery.
Make It Count can generate images with the exact number of objects specified in the prompt while keeping a natural layout. It uses the diffusion model to accurately count and separate objects during the image creation process.
Glyph-ByT5-v2 is a new SDXL model that can generate high-quality visual layouts with text in 10 different languages.
MeshAnything can convert 3D assets in any 3D representation into meshes. This can be used to enhance various 3D asset production methods and significantly improve storage, rendering, and simulation efficiencies.
GradeADreamer is yet another text-to-3D method. This one is capable of producing high-quality assets with a total generation time of under 30 minutes using only a single RTX 3090 GPU.
HairFastGAN can transfer hairstyles from one image to another in near real-time. It handles different poses and colors well, achieving high quality in under a second on an Nvidia V100.
MM-Diffusion can generate high-quality audio-video pairs using a multi-modal diffusion model with two coupled denoising autoencoders.
DMD2 is a new improved distillation method that can turn diffusion models into efficient one-step image generators.
EditWorld can simulate world dynamics and edit images based on instructions that are grounded in various world scenarios. The method is able to add, replace, delete, and move objects in images, as well as change their attributes and perform other operations.
RectifID is yet another personalization method from user-provided reference images of human faces, live subjects, and certain objects for diffusion models.