AI Toolbox
A curated collection of 867 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Diffusion as Shader can generate high-quality videos from 3D tracking inputs.
MaterialFusion can transfer materials onto objects in images while letting users control how much material is applied.
Lumina-Video can generate high-quality videos with synchronized sound from text prompts.
Light-A-Video can relight videos without flickering.
PeriodWave can generate high-quality speech waveforms by capturing repeating sound patterns. It uses a period-aware flow matching estimator to outperform other models in text-to-speech tasks and Mel-spectrogram reconstruction.
LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.
FlashVideo can generate videos from text prompts and upscale them to 1080p.
Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.
VideoGuide can improve the quality of videos made by text-to-video models without needing extra training. It enhances the smoothness of motion and clarity of images, making the videos more coherent and visually appealing.
Video Alchemist can generate personalized videos using text prompts and reference images. It supports multiple subjects and backgrounds without long setup times, achieving high-quality results with better subject fidelity and text alignment.
TeSMo is a method for text-controlled scene-aware motion generation and is able to generate realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses.
MotionLab can generate and edit human motion and supports text-based and trajectory-based motion creation.
SMF can transfer 2D or 3D keypoint animations to full-body mesh animations without needing template meshes or corrective keyframes.
ControlFace can edit face images with precise control over pose, expression, and lighting. It uses a dual-branch U-Net architecture and is trained on facial videos to ensure high-quality results while keeping the person’s identity intact.
OmniPhysGS can generate realistic 3D dynamic scenes by modeling objects with Constitutive 3D Gaussians.
GestureLSM can generate real-time co-speech gestures by modeling how different body parts interact.
Imagine360 can generate high-quality 360° videos from monologue single-view videos.
Wonderland can generate high-quality 3D scenes from a single image using a camera-guided video diffusion model. It allows for easy navigation and exploration of 3D spaces, performing better than other methods, especially with images it hasn’t seen before.
DiffSplat can generate 3D Gaussian splats from text prompts and single-view images in 1-2 seconds.
Stable Flow can edit images by adding, removing, or changing objects.