AI Toolbox
A curated collection of 754 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Step-Video-T2V can generate high-quality videos up to 204 frames long using a 30B parameter text-to-video model.
MIGE can generate images from text prompts and reference images and edit existing images based on instructions.
Cycle3D can generate high-quality and consistent 3D content from a single unposed image. This approach enhances texture consistency and multi-view coherence, significantly improving the quality of the final 3D reconstruction.
LIFe-GoM can create animatable 3D human avatars from sparse multi view images in under 1 second. It renders high-quality images at 95.1 frames per second.
DressRecon can create 3D human body models from single videos. It handles loose clothing and objects well, achieving high-quality results by combining general human shapes with specific video movements.
Dora can generated 3D assets from images which are ready for diffusion-based character control in modern 3D engines, such as Unity 3D, in real-time.
Magic 1-For-1 can generate one-minute video clips in just one minute.
VD3D enables camera control for video diffusion models and can transfer the camera trajectory from a reference video.
InstantSwap can swap concepts in images from a reference image while keeping the foreground and background consistent. It uses automated bounding box extraction and cross-attention to make the process more efficient by reducing unnecessary calculations.
Diffusion as Shader can generate high-quality videos from 3D tracking inputs.
MaterialFusion can transfer materials onto objects in images while letting users control how much material is applied.
Lumina-Video can generate high-quality videos with synchronized sound from text prompts.
Light-A-Video can relight videos without flickering.
PeriodWave can generate high-quality speech waveforms by capturing repeating sound patterns. It uses a period-aware flow matching estimator to outperform other models in text-to-speech tasks and Mel-spectrogram reconstruction.
LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.
FlashVideo can generate videos from text prompts and upscale them to 1080p.
Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.
VideoGuide can improve the quality of videos made by text-to-video models without needing extra training. It enhances the smoothness of motion and clarity of images, making the videos more coherent and visually appealing.
Video Alchemist can generate personalized videos using text prompts and reference images. It supports multiple subjects and backgrounds without long setup times, achieving high-quality results with better subject fidelity and text alignment.
TeSMo is a method for text-controlled scene-aware motion generation and is able to generate realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses.