AI Toolbox
A curated collection of 692 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





TRG can estimate 6DoF head translations and rotations by leveraging the synergy between facial geometry and head pose.
StdGEN can generate high-quality 3D characters from a single image in just three minutes. It breaks down characters into parts like body, clothes, and hair, using a transformer-based model for great results in 3D anime character generation.
Spark-TTS can generate customizable voices with control over gender, speaking style, pitch, and rate. It also supports zero-shot voice cloning, allowing smooth language transitions without extra training for each voice.
So far it has been tough to imagine the benefits of AI agents. Most of what we’ve seen from that domain has been focused on NPC simulations or solving text-based goals. 3D-GPT is a new framework that utilizes LLMs for instruction-driven 3D modeling by breaking down 3D modeling tasks into manageable segments to procedurally generate 3D scenes. I recently started to dig into Blender and I pray this gets open sourced at one point.
VideoMaker can generate personalized videos from a single subject reference image.
Generative Photography can generate consistent images from text with an understanding of camera physics. The method can control camera settings like bokeh and color temperatures to create consistent images with different effects.
Dream Engine can generate images by combining different concepts from reference images.
ImageRAG can find relevant images based on a text prompt to improve image generation. It helps create rare and detailed concepts without needing special training, making it useful for different image models.
InsTaG can generate realistic 3D talking heads from just a few seconds of video.
Phidias can generate high-quality 3D assets from text, images, and 3D references. It uses a method called reference-augmented diffusion to improve quality and speed, achieving results in just a few seconds.
EventEgo3D++ can capture 3D human motion using a monocular event camera with a fisheye lens. It works well in low-light and high-speed conditions, providing real-time 3D pose updates at 140Hz with high accuracy compared to RGB-based methods.
Cyberpunk brain dances are becoming a thing! D-NPC can turn videos into dynamic neural point clouds aka 4D scenes which makes it possible to watch a scene from another perspective.
Distill Any Depth can generate depth maps from images.
GHOST 2.0 is a deepfake method that can transfer heads from one image to another while keeping the skin color and structure intact.
KV-Edit can edit images while keeping the background consistent. It allows users to add, remove, or change objects without needing extra training, ensuring high image quality.
Any2AnyTryon can generate high-quality virtual try-on results by transferring clothes onto images as well as reconstructing garments from real-world images.
NotaGen can generate high-quality classical sheet music.
UniCon can handle different image generation tasks using a single framework. It adapts a pretrained image diffusion model with only about 15% extra parameters and supports most base ControlNet transformations.
MatAnyone can generate stable and high-quality human video matting masks.
SongGen can generate both vocals and accompaniment from text prompts using a single-stage auto-regressive transformer. It allows users to control lyrics, genre, mood, and instrumentation, and offers mixed mode for combined tracks or dual-track mode for separate tracks.