AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
ReferDINO can segment objects in videos using text descriptions. It improves accuracy with a special mask decoder and enhances understanding of movement over time.
XVerse can create high-quality images with multiple subjects that can be edited. It allows precise control over each subject’s pose, style, and lighting, while also reducing issues like attribute entanglement and artifacts.
ThinkSound can generate sound from video either with a caption or Chain-of-Thought.
Matrix-Game can generate high-quality interactive game worlds in Minecraft.
OmniAvatar can generate lifelike full-body avatar videos from audio. It offers accurate lip-syncing and natural movements, and allows for precise control over emotions and backgrounds.
GaVS can stabilize videos by reconstructing and rendering them in 3D.
Text-Aware Image Restoration can restore images and retain the accuracy of text in them.
ControlMM can generate high-quality motion in real-time by using spatial control signals in a motion model. It is 20 times faster than other methods and can control body parts, timelines, and avoid obstacles.
OmniSep can isolate clean soundtracks from mixed audio using text, images, and audio queries.
SwiftEdit can edit images quickly using text prompts in just 0.23 seconds.
LayoutVLM can generate 3D layouts from text instructions. It improves how well layouts match the intended design and works effectively in crowded spaces.
AnchorCrafter can generate high-quality 2D videos of people interacting with a reference product.
Hunyuan3D 2.1 can generate high-quality 3D assets from images through shape generation and texture synthesis.
PosterCraft can generate high-quality aesthetic posters by improving how text and art work together.
PartPacker can generate high-quality 3D objects with many meaningful parts from a single image.
MIMO can create controllable character videos from a single image. It allows users to animate characters with complex motions in real-world scenes by encoding 2D videos into 3D spatial codes for flexible control.
GEN3C can generate photorealistic videos from single or sparse-view images while keeping camera control and 3D consistency.
LinGen can generate high-resolution minute-length videos on a single GPU.
MultiTalk can generate videos of multiple people talking by using audio from different sources, a reference image, and a prompt.
D3-Human can reconstruct detailed 3D human figures from single videos. It separates clothing and body shapes, handles occlusions well, and is useful for clothing transfer and animation.