AI Toolbox
A curated collection of 942 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





InvSR can upscale images in one to five steps. It achieves great results even with just one step, making it efficient for improving images in real-world situations.
DisPose can generate high-quality human image animations from sparse skeleton pose guidance.
Personalized Restoration is a method that can restore degraded images of faces while retaining the identity of the person using reference images. The method is able to edit the restored image using text prompts, enabling modifications like changing the color of the eyes or making the person smile.
Leffa can generate person images based on reference images, allowing for precise control over appearance and pose.
TryOffAnyone can generate high-quality images of clothing on models from photos.
SynCamMaster can generate videos from different viewpoints while keeping the look and shape consistent. It improves text-to-video models for multi-camera use and allows re-rendering from new angles.
ObjCtrl-2.5D enables object control in image-to-video generation using 3D trajectories from 2D inputs with depth information.
PRM can create high-quality 3D meshes from a single image using photometric stereo techniques. It improves detail and handles changes in lighting and materials, allowing for features like relighting and material editing.
3DTrajMaster can control the 3D motions of multiple objects in videos using user-defined 6DoF pose sequences.
FireFlow is FLUX-dev editing method that can perform fast image inversion and semantic editing with just 8 diffusion steps.
Tactile DreamFusion can improve 3D asset generation by combining high-resolution tactile sensing with diffusion-based image priors. Supports both text-to-3D and image-to-3D generation.
Factor Graph Diffusion can generate high-quality images with better prompt adherence. The method allows for controllable image creation using tools like segmentation and depth maps.
On the other hand, Customizing Motion can learn and generalize input motion patterns from input videos and apply them to new and unseen contexts.
MEMO can generate talking videos from images and audio. It keeps the person’s identity consistent and matches lip movements to the audio, producing natural expressions.
MV-Adapter can generate images from multiple views while keeping them consistent across views. It enhances text-to-image models like Stable Diffusion XL, supporting both text and image inputs, and achieves high-resolution outputs at 768x768.
CAVIS can do instance segmentation on videos. It’s able to better track objects and improve instance matching accuracy, resulting in more accurate and stable instance segmentation.
VideoRepair can improve text-to-video generation by finding and fixing small mismatches between text prompts and videos.
Trellis 3D generates high-quality 3D assets in formats like Radiance Fields, 3D Gaussians, and meshes. It supports text and image conditioning, offering flexible output format selection and local 3D editing capabilities.
Anagram-MTL can generate visual anagrams that change appearance with transformations like flipping or rotating.
Dessie can estimate the 3D shape and pose of horses from single images. It also works with other large animals like zebras and cows.