AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Vevo can imitate voices without needing specific training data. It can change accents and emotions while keeping output high quality, using a self-supervised method that separates different speech features.
Argus3D can generate 3D meshes from images and text prompts as well as unique textures for its generated shapes. Just imagine composing a 3D scene and fill it with objects by pointing at a space and using natural language to describe what you want to place there.
AudioEditing are two methods for editing audio. The first technique allows for text-based editing, while the second is an approach for discovering semantically meaningful editing directions without supervision.
Magic-Me can generate identity-specific videos from a few reference images while keeping the person’s features clear.
Continuous 3D Words is a control method that can modify attributes in images with a slider based approach. This allows for more control over illumination, non-rigid shape changes (like wings), and camera orientation for instance.
GALA3D is a text-to-3D method that can generate complex scenes with multiple objects and control their placement and interaction. The method uses large language models to generate initial layout descriptions and then optimizes the 3D scene with conditioned diffusion to make it more realistic.
LGM can generate high-resolution 3D models from text prompts or single-view images. It uses a fast multi-view Gaussian representation, producing models in under 5 seconds while maintaining high quality.
ConsistI2V is an image-to-video method with enhanced visual consistency. Compared to other methods, this one is able to better maintain the subject, background, and style from the first frame, as well as ensure a fluid and logical progression while supporting long video generation as well as camera motion control.
Direct-a-Video can individually or jointly control camera movement and object motion in text-to-video generations. This means you can generate a video and tell the model to move the camera from left to right, zoom in or out and move objects around in the scene.
Video-LaVIT is a multi-modal video-language method that can comprehend and generate image and video content and supports long video generation.
InterScene is a novel framework that enables physically simulated characters to perform long-term interaction tasks in diverse, cluttered, and unseen scenes. Another step closer to completely dynamic game worlds and simulations. Checkout an impressive demo below.
AToM is a text-to-mesh framework that can generate high-quality textured 3D meshes from text prompts in less than a second. The method is optimized across multiple prompts and is able to create diverse objects for which it wasn’t trained on.
Last year we got real-time diffusion for images, this year we’ll get it for video! AnimateLCM can generate high-fidelity videos with minimal steps. The model also supports image-to-video as well as support for adapters like ControlNet. It’s not available yet, but once it hits, expect way more AI generated video content.
SEELE can move around objects within an image. It does so by removing it, inpainting occluded portions and harmonizing the appearance of the repositioned object with the surrounding areas.
Motion-I2V can generate videos from images with clear and controlled motion. It uses a two-stage process with a motion field predictor and temporal attention, allowing for precise control over how things move and enabling video-to-video translation without needing extra training.
StableIdentity is a method that can generate diverse customized images in various contexts from a single input image. The cool thing about this method is, that it is able to combine the learned identity with ControlNet and even inject it into video (ModelScope) and 3D (LucidDreamer) generation.
pix2gestalt is able to estimate the shape and appearance of whole objects that are only partially visible behind occlusions.
GALA can turn a single-layer clothed 3D human mesh and decompose it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create new clothed human avatars with any pose.
Depth Anything is a new monocular depth estimation method. The model is trained on 1.5M labeled images and 62M+ unlabeled images, which results in impressive generalization ability.
Language-Driven Video Inpainting can guide the video inpainting process using natural language instructions, which removes the need for manual mask labeling.