AI Toolbox
A curated collection of 917 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





FantasyTalking can generate talking portraits from a single image, making them look realistic with accurate lip movements and facial expressions. It uses a two-step process to align audio and video, allowing users to control how expressions and body motions appear.
Textoon can generate diverse 2D cartoon characters in the Live2D format from text descriptions. It allows for real-time editing and controllable appearance generation, making it easy for users to create interactive characters.
GPS-Gaussian+ can render high-resolution 3D scenes from 2 or more input images in real-time.
Step1X-Edit can perform advanced image editing tasks by processing reference images and user instructions.
[Describe Anything] can generate detailed descriptions for specific areas in images and videos using points, boxes, scribbles, or masks. It produces context-aware captions that highlight subtle details and changes over time, achieving top performance on seven benchmarks for localized captioning.
SwiftBrush v2 can improve the quality of images generated by one-step text-to-image diffusion models. Results look great, and apparently it ranks better than all GAN-based and multi-step Stable Diffusion models in benchmarks. No code though 🤷♂️
InstantCharacter can generate high-quality images of personalized characters from a single reference image with FLUX. It supports different styles and poses, ensuring identity consistency and allowing for text-based edits.
ID-Patch can generate personalized group photos by matching faces with specific positions. It reduces problems like identity leakage and visual errors, achieving high accuracy and speed—seven times faster than other methods.
Phantom can generate videos that keep the subject’s identity from images while matching them with text prompts.
SkyReels-V2 can generate infinite-length videos by combining a Diffusion Forcing framework with Multi-modal Large Language Models and Reinforcement Learning.
SCW-VTON can fit in-shop clothing to a person’s image while keeping their pose consistent. It improves the shape of the clothing and reduces distortions in visible limb areas, making virtual try-on results look more realistic.
Ev-DeblurVSR can turn low-resolution and blurry videos into high-resolution ones.
PosterMaker can generate high-quality product posters by rendering text accurately and keeping the main subject clear.
FramePack aims to make video generation feel like image gen. It can generate single video frames in 1.5 seconds with 13B models on a RTX 4090. Also supports full fps-30 with 13B models using a 6GB laptop GPU, but obviously slower.
IMAGGarment-1 can generate high-quality garments with control over shape, color, and logo placement.
Cobra can efficiently colorize line art by utilizing over 200 reference images.
UniAnimate-DiT can generate high-quality animations from human images. It uses the Wan2.1 model and a lightweight pose encoder to create smooth and visually appealing results, while also upscaling animations from 480p to 720p.
CoMotion can detect and track 3D poses of multiple people using just one camera. It works well in crowded places and can keep track of movements over time with high accuracy.