AI Toolbox
A curated collection of 910 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Subsurface Scattering for Gaussian Splatting can render and relight translucent objects in real time. It allows for detailed material editing and achieves high visual quality at around 150 FPS.
Pusa V1.0 can generate high-quality videos from images and text prompts. It achieves a VBench-I2V score of 87.32% with only $500 in training costs and supports features like video transitions and extensions.
Reflect3D can detect 3D reflection symmetry from a single RGB image and improve 3D generation.
GlobalPose can capture human motion in 3D space using 6 IMUs (Inertial Measurement Unit). It accurately reconstructs global motions and local poses while estimating 3D contacts and forces.
PhysX can generate 3D assets with detailed physical properties, which labels assets in five key areas: scale, material, affordance, kinematics, and function.
ACTalker can generate talking head videos by combining audio and facial motion to control specific facial areas.
SpatialTrackerV2 can track 3D points in videos using a single system for point tracking, depth, and camera position.
CharaConsist built on top of FLUX.1 can generate consistent characters in text-to-image sequences.
UltraZoom can create gigapixel-resolution images from regular photos by upscaling them with detailed close-ups.
HOIFH generates synchronized object motion, full-body human motion, and detailed finger motion. It is designed for manipulating large objects within contextual environments, guided by human-level instructions.
CoDi can generate images that keep the same subject across different poses and layouts.
OSDFace can restore low-quality face images in one step, making it faster than traditional methods. It produces high-quality images while keeping the person’s identity consistent.
CODiff can remove severe JPEG artifacts from highly compressed images. It uses a one-step diffusion process and a compression-aware visual embedder (CaVE) to improve image quality.
GeoSplatting can capture detailed 3D shapes and realistic materials and lighting.
Add-it can add objects to images based on text prompts without extra training. It uses a smart attention system for natural placement and consistency, achieving top results in image insertion tasks.
Tora can generate high-quality videos with precise control over motion trajectories by integrating textual, visual, and trajectory conditions. It achieves high motion fidelity and allows for diverse video durations, aspect ratios, and resolutions, making it a versatile tool for video generation.
Tora2 can generate videos with customized motion and appearance for multiple entities.
Hear-Your-Click can generate specific sounds for objects in videos when users click on them. It improves the connection between sound and visuals, allowing for precise audio that matches user-selected objects.
ObjectClear can remove objects from images while also getting rid of shadows and reflections. It uses an object-effect attention mechanism to improve how well it removes foregrounds and keeps backgrounds, making it much better than other methods, especially in complex scenes.
SketchSeg can segment raster sketches into layers, making it easy for artists to move, copy, or delete objects.