AI Toolbox
A curated collection of 813 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





InterMask can generate high-quality 3D human interactions from text descriptions. It captures complex movements between two people while also allowing for reaction generation without changing the model.
Photometric Inverse Rendering can figure out light positions and reflections in images, including tricky shadows. The method it employs breaks down surface reflections better than other tools, working well on both fake and real pictures.
KDTalker can generate high-quality talking portraits from a single image and audio input. It captures fine facial details and achieves excellent lip synchronization using a 3D keypoint-based approach and a spatiotemporal diffusion model.
MagicColor can automatically colorize multi-instance sketches while keeping colors consistent across objects using reference images.
TreeMeshGPT can generate detailed 3D meshes from point clouds using Autoregressive Tree Sequencing. This technique allows for better mesh detail and achieves a 22% reduction in data size during processing.
Mobius can generate seamlessly looping videos from text descriptions.
DART can generate high-quality human motions in real-time, achieving over 300 frames per second on a single RTX 4090 GPU. It combines text inputs with spatial constraints, allowing for tasks like reaching waypoints and interacting with scenes.
MelQCD can create realistic audio tracks that match silent videos. It achieves high quality and synchronization by breaking down mel-spectrograms into different signal types and using a video-to-all (V2X) predictor.
MovieAgent can generate long-form videos with multiple scenes and shots from a script and character bank. It ensures character consistency and synchronized subtitles while reducing the need for human input in movie production.
Make-It-Animatable can auto-rig any 3D humanoid model for animation in under one second. It generates high-quality blend weights and bones, and works with various 3D formats, ensuring accuracy even for non-standard skeletons.
AnCoGen can analyze and generate speech by estimating key attributes like speaker identity, pitch, and loudness. It can also perform tasks such as speech denoising, pitch shifting, and voice conversion using a unified masked autoencoder model.
Chrono can track points in videos with an understanding of time.
VIRES can repaint, replace, generate, and remove objects in videos using sketches and text.
Diffusion VAS can generate masks for hidden parts of objects in videos.
TRG can estimate 6DoF head translations and rotations by leveraging the synergy between facial geometry and head pose.
StdGEN can generate high-quality 3D characters from a single image in just three minutes. It breaks down characters into parts like body, clothes, and hair, using a transformer-based model for great results in 3D anime character generation.
Spark-TTS can generate customizable voices with control over gender, speaking style, pitch, and rate. It also supports zero-shot voice cloning, allowing smooth language transitions without extra training for each voice.
So far it has been tough to imagine the benefits of AI agents. Most of what we’ve seen from that domain has been focused on NPC simulations or solving text-based goals. 3D-GPT is a new framework that utilizes LLMs for instruction-driven 3D modeling by breaking down 3D modeling tasks into manageable segments to procedurally generate 3D scenes. I recently started to dig into Blender and I pray this gets open sourced at one point.
VideoMaker can generate personalized videos from a single subject reference image.
Generative Photography can generate consistent images from text with an understanding of camera physics. The method can control camera settings like bokeh and color temperatures to create consistent images with different effects.