AI Toolbox
A curated collection of 667 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





MEGASAM can estimate camera parameters and depth maps from casual monocular videos.
Cycle3D can generate high-quality and consistent 3D content from a single unposed image. This approach enhances texture consistency and multi-view coherence, significantly improving the quality of the final 3D reconstruction.
DressRecon can create 3D human body models from single videos. It handles loose clothing and objects well, achieving high-quality results by combining general human shapes with specific video movements.
Dora can generated 3D assets from images which are ready for diffusion-based character control in modern 3D engines, such as Unity 3D, in real-time.
Magic 1-For-1 can generate one-minute video clips in just one minute.
VD3D enables camera control for video diffusion models and can transfer the camera trajectory from a reference video.
InstantSwap can swap concepts in images from a reference image while keeping the foreground and background consistent. It uses automated bounding box extraction and cross-attention to make the process more efficient by reducing unnecessary calculations.
Diffusion as Shader can generate high-quality videos from 3D tracking inputs.
MaterialFusion can transfer materials onto objects in images while letting users control how much material is applied.
Lumina-Video can generate high-quality videos with synchronized sound from text prompts.
Light-A-Video can relight videos without flickering.
PeriodWave can generate high-quality speech waveforms by capturing repeating sound patterns. It uses a period-aware flow matching estimator to outperform other models in text-to-speech tasks and Mel-spectrogram reconstruction.
LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.
FlashVideo can generate videos from text prompts and upscale them to 1080p.
Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.
Video Alchemist can generate personalized videos using text prompts and reference images. It supports multiple subjects and backgrounds without long setup times, achieving high-quality results with better subject fidelity and text alignment.
TeSMo is a method for text-controlled scene-aware motion generation and is able to generate realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses.
MotionLab can generate and edit human motion and supports text-based and trajectory-based motion creation.
ControlFace can edit face images with precise control over pose, expression, and lighting. It uses a dual-branch U-Net architecture and is trained on facial videos to ensure high-quality results while keeping the person’s identity intact.
OmniPhysGS can generate realistic 3D dynamic scenes by modeling objects with Constitutive 3D Gaussians.