AI Toolbox
A curated collection of 883 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Sketch2Anim can turn 2D storyboard sketches into high-quality 3D animations. It uses a motion generator for precise control and a neural mapper to align 2D sketches with 3D motion, allowing for easy editing and animation control.
SyncTalk++ can generate high-quality talking head videos with synchronized lip movements and facial expressions. It uses Gaussian Splatting for consistent subject identity and can render up to 101 frames per second.
MVPaint can generate high-resolution, seamless textures for 3D models. It uses a three-stage process for better texture quality, including multi-view generation and UV refinement to reduce visible seams.
Subsurface Scattering for Gaussian Splatting can render and relight translucent objects in real time. It allows for detailed material editing and achieves high visual quality at around 150 FPS.
Pusa V1.0 can generate high-quality videos from images and text prompts. It achieves a VBench-I2V score of 87.32% with only $500 in training costs and supports features like video transitions and extensions.
Reflect3D can detect 3D reflection symmetry from a single RGB image and improve 3D generation.
GlobalPose can capture human motion in 3D space using 6 IMUs (Inertial Measurement Unit). It accurately reconstructs global motions and local poses while estimating 3D contacts and forces.
PhysX can generate 3D assets with detailed physical properties, which labels assets in five key areas: scale, material, affordance, kinematics, and function.
ACTalker can generate talking head videos by combining audio and facial motion to control specific facial areas.
SpatialTrackerV2 can track 3D points in videos using a single system for point tracking, depth, and camera position.
CharaConsist built on top of FLUX.1 can generate consistent characters in text-to-image sequences.
UltraZoom can create gigapixel-resolution images from regular photos by upscaling them with detailed close-ups.
HOIFH generates synchronized object motion, full-body human motion, and detailed finger motion. It is designed for manipulating large objects within contextual environments, guided by human-level instructions.
CoDi can generate images that keep the same subject across different poses and layouts.
OSDFace can restore low-quality face images in one step, making it faster than traditional methods. It produces high-quality images while keeping the person’s identity consistent.
CODiff can remove severe JPEG artifacts from highly compressed images. It uses a one-step diffusion process and a compression-aware visual embedder (CaVE) to improve image quality.
GeoSplatting can capture detailed 3D shapes and realistic materials and lighting.
Add-it can add objects to images based on text prompts without extra training. It uses a smart attention system for natural placement and consistency, achieving top results in image insertion tasks.
Tora can generate high-quality videos with precise control over motion trajectories by integrating textual, visual, and trajectory conditions. It achieves high motion fidelity and allows for diverse video durations, aspect ratios, and resolutions, making it a versatile tool for video generation.
Tora2 can generate videos with customized motion and appearance for multiple entities.