AI Toolbox
A curated collection of 917 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





ShoulderShot can generate over-the-shoulder dialogue videos that keep characters looking the same and maintain a smooth flow between shots. It allows for longer conversations and offers more flexibility in how shots are arranged.
SDMatte can extract objects from images using visual prompts like points, boxes, and masks.
Event-Driven Storytelling can generate realistic movements for multiple characters in a 3D scene. It uses a large language model to understand complex interactions, allowing for diverse and scalable behavior planning based on character relationships and their positions.
DPoser-X can generate and complete 3D whole-body human poses using a diffusion-based model.
VideoColorGrading can generate a look-up table (LUT) for matching colors between reference scenes and input videos.
SyncTalk++ can generate high-quality talking head videos with synchronized lip movements and facial expressions. It uses Gaussian Splatting for consistent subject identity and can render up to 101 frames per second.
MVPaint can generate high-resolution, seamless textures for 3D models. It uses a three-stage process for better texture quality, including multi-view generation and UV refinement to reduce visible seams.
Subsurface Scattering for Gaussian Splatting can render and relight translucent objects in real time. It allows for detailed material editing and achieves high visual quality at around 150 FPS.
Pusa V1.0 can generate high-quality videos from images and text prompts. It achieves a VBench-I2V score of 87.32% with only $500 in training costs and supports features like video transitions and extensions.
Reflect3D can detect 3D reflection symmetry from a single RGB image and improve 3D generation.
GlobalPose can capture human motion in 3D space using 6 IMUs (Inertial Measurement Unit). It accurately reconstructs global motions and local poses while estimating 3D contacts and forces.
PhysX can generate 3D assets with detailed physical properties, which labels assets in five key areas: scale, material, affordance, kinematics, and function.
ACTalker can generate talking head videos by combining audio and facial motion to control specific facial areas.
SpatialTrackerV2 can track 3D points in videos using a single system for point tracking, depth, and camera position.
CharaConsist built on top of FLUX.1 can generate consistent characters in text-to-image sequences.
UltraZoom can create gigapixel-resolution images from regular photos by upscaling them with detailed close-ups.
HOIFH generates synchronized object motion, full-body human motion, and detailed finger motion. It is designed for manipulating large objects within contextual environments, guided by human-level instructions.
CoDi can generate images that keep the same subject across different poses and layouts.
OSDFace can restore low-quality face images in one step, making it faster than traditional methods. It produces high-quality images while keeping the person’s identity consistent.
CODiff can remove severe JPEG artifacts from highly compressed images. It uses a one-step diffusion process and a compression-aware visual embedder (CaVE) to improve image quality.