AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





MagicQuill enables efficient image editing with a simple interface that lets users easily insert elements and change colors. It uses a large language model to understand editing intentions in real time, improving the quality of the results.
GarmentDreamer can generate wearable, simulation-ready 3D garment meshes from text prompts. The method is able to generate diverse geometric and texture details, making it possible to create a wide range of different clothing items.
JoyVASA can generate high-quality lip-sync videos of human and animal faces from a single image and speech clip.
SPARK can create high-quality 3D face avatars from regular videos and track expressions and poses in real time. It improves the accuracy of 3D face reconstructions for tasks like aging, face swapping, and digital makeup.
CHANGER can integrate an actor’s head onto a target body in digital content. It uses chroma keying for clear backgrounds and enhances blending quality with Head shape and long Hair augmentation (H2 augmentation) and a Foreground Predictive Attention Transformer (FPAT).
Scaling Mesh Generation via Compressive Tokenization can generate high-quality meshes with over 8,000 faces.
DAWN can generate talking head videos from a single portrait and audio clip. It produces lip movements and head poses quickly, making it effective for creating long video sequences.
DimensionX can generate photorealistic 3D and 4D scenes from a single image using controllable video diffusion.
SG-I2V can control object and camera motion in image-to-video generation using bounding boxes and trajectories
RayGauss can create realistic new views of 3D scenes, using Gaussian-based ray casting! It produces high-quality images quickly, running at 25 frames per second, and avoids common picture problems that older methods had.
CLoSD can control characters in physics-based simulations using text prompts. It can navigate to goals, strike objects, and switch between sitting and standing, all guided by simple instructions.
GIMM is a new video interpolation method that uses motion modelling to predict motion between frames.
Regional-Prompting-FLUX adds regional prompting capabilities to diffusion transformers like FLUX. It effectively manages complex prompts and works well with tools like LoRA and ControlNet.
AutoVFX can automatically create realistic visual effects in videos from a single image and text instructions.
Adaptive Caching can speed up video generation with Diffusion Transformers by caching important calculations. It can achieve up to 4.7 times faster video creation at 720p without losing quality.
ZIM can generate precise matte masks from segmentation labels, enabling zero-shot image matting.
Face Anon can anonymize faces in images while keeping original facial expressions and head positions. It uses diffusion models to achieve high-quality image results and can also perform face swapping tasks.
CityGaussianV2 can reconstruct large-scale scenes from multi-view RGB images with high accuracy.
Self-Supervised Any-Point Tracking by Contrastive Random Walks can track any point in a video using a self-supervised global matching transformer.
MOFT is a training-free video motion interpreter and controller. It can be used to extract motion information from video diffusion models and guide the motion of generated videos without the need for retraining.