AI Toolbox
A curated collection of 734 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





ARTalk can generate realistic 3D head motions, including lip synchronization, blinking, and facial expressions, from audio in real-time.
InfiniteYou can generate high-quality images with FLUX and retain a person’s identity.
StarVector can generate scalable vector graphics (SVG) code from pixel images.
Diptych Prompting can generate images of new subjects in specific contexts by treating text-to-image generation as an inpainting task.
MotionStreamer can generate human motions based on text prompts and supports motion composition and longer motion generation. Also has a Blender plugin.
Thera can upscale images to super-resolution using with neural heat fields that model a precise point spread function. This method allows for correct anti-aliasing at any output size.
DreamRenderer extends FLUX with image content control using bounding boxes or masks.
InterMask can generate high-quality 3D human interactions from text descriptions. It captures complex movements between two people while also allowing for reaction generation without changing the model.
Photometric Inverse Rendering can figure out light positions and reflections in images, including tricky shadows. The method it employs breaks down surface reflections better than other tools, working well on both fake and real pictures.
KDTalker can generate high-quality talking portraits from a single image and audio input. It captures fine facial details and achieves excellent lip synchronization using a 3D keypoint-based approach and a spatiotemporal diffusion model.
MagicColor can automatically colorize multi-instance sketches while keeping colors consistent across objects using reference images.
TreeMeshGPT can generate detailed 3D meshes from point clouds using Autoregressive Tree Sequencing. This technique allows for better mesh detail and achieves a 22% reduction in data size during processing.
Mobius can generate seamlessly looping videos from text descriptions.
DART can generate high-quality human motions in real-time, achieving over 300 frames per second on a single RTX 4090 GPU. It combines text inputs with spatial constraints, allowing for tasks like reaching waypoints and interacting with scenes.
MelQCD can create realistic audio tracks that match silent videos. It achieves high quality and synchronization by breaking down mel-spectrograms into different signal types and using a video-to-all (V2X) predictor.
MovieAgent can generate long-form videos with multiple scenes and shots from a script and character bank. It ensures character consistency and synchronized subtitles while reducing the need for human input in movie production.
Make-It-Animatable can auto-rig any 3D humanoid model for animation in under one second. It generates high-quality blend weights and bones, and works with various 3D formats, ensuring accuracy even for non-standard skeletons.
Chrono can track points in videos with an understanding of time.
VIRES can repaint, replace, generate, and remove objects in videos using sketches and text.
Diffusion VAS can generate masks for hidden parts of objects in videos.