AI Toolbox
A curated collection of 811 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Ev-DeblurVSR can turn low-resolution and blurry videos into high-resolution ones.
PosterMaker can generate high-quality product posters by rendering text accurately and keeping the main subject clear.
FramePack aims to make video generation feel like image gen. It can generate single video frames in 1.5 seconds with 13B models on a RTX 4090. Also supports full fps-30 with 13B models using a 6GB laptop GPU, but obviously slower.
IMAGGarment-1 can generate high-quality garments with control over shape, color, and logo placement.
Cobra can efficiently colorize line art by utilizing over 200 reference images.
UniAnimate-DiT can generate high-quality animations from human images. It uses the Wan2.1 model and a lightweight pose encoder to create smooth and visually appealing results, while also upscaling animations from 480p to 720p.
CoMotion can detect and track 3D poses of multiple people using just one camera. It works well in crowded places and can keep track of movements over time with high accuracy.
PartField can segment 3D shapes into parts without using templates or text names.
IP-Composer can generate compositional images by using multiple input images and natural language prompts.
PhysFlow can simulate dynamic interactions in complex scenes. It identifies material types through image queries and enhances realism using video diffusion and a Material Point Method for detailed 4D representations.
Hi3DGen can generate high-quality 3D shapes from 2D images. It uses a three-step process to accurately capture fine details, outperforming other methods in realism.
HoloPart can break down 3D shapes into complete and meaningful parts, even if they are hidden. It also supports numerous downstream applications such as Geometry Editing, Geometry Processing, Material Editing and Animation.
AniSDF can reconstruct high-quality 3D shapes with improved surface geometry. It can handle complex, luminous, reflective as well as fuzzy objects.
Pixel3DMM can reconstruct 3D human faces from a single RGB image.
OmniCaptioner can generate detailed text descriptions for various types of content like images, math formulas, charts, user interfaces, pdfs, videos and more.
ReCamMaster can re-capture videos from new camera angles.
GARF can reassemble 3D objects from real-world fractured parts.
NormalCrafter can generate consistent surface normals from video sequences. It uses video diffusion models and Semantic Feature Regularization to ensure accurate normal estimation while keeping details clear across frames.
TTT-Video can create coherent one-minute videos from text storyboards. As the title of this paper says, this uses test-time training instead of self-attention layers to be able to produce consistent multi-context scenes, which is quite the achievement. The paper is worth a read.
Piece it Together can combine different visual components into complete characters or objects. It uses a lightweight flow-matching model called IP-Prior to improve prompt adherence and enable diverse, context-aware generations.