AI Toolbox
A curated collection of 849 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





DepthSplat can reconstruct 3D scenes form only a few images by connecting Gaussian splatting and depth estimation.
MonST3R can estimate 3D shapes from videos over time, creating a dynamic point cloud and tracking camera positions. This method improves video depth estimation and separates moving from still objects more effectively than previous techniques.
UniPortrait can customize images of one or more people with high quality. It allows for detailed face editing and uses free-form text descriptions to guide changes.
F5-TTS can generate natural-sounding speech using a fast text-to-speech system. It supports multiple languages, can switch between languages smoothly, and is trained on a large dataset of 100,000 hours.
MimicTalk can generate personalized 3D talking faces in under 15 minutes. It mimics a person’s talking style using a special audio-to-motion model, resulting in high-quality videos.
GS^3 can relight scenes in real-time using a triple Gaussian splatting process. It achieves high-quality lighting and view synthesis from multiple images, running at 90 fps on a single GPU.
DreamWaltz-G can generate high-quality 3D avatars from text and animate them using SMPL-X motion sequences. It improves avatar consistency with Skeleton-guided Score Distillation and is useful for human video reenactment and creating scenes with multiple subjects.
Tex4D can generate 4D textures for untextured mesh sequences from a text prompt. It combines 3D geometry with video diffusion models to ensure the textures are consistent across different views and frames.
HART is an autoregressive transformer model that can generate high-quality 1024x1024 images from text 3x times faster than SD3-Medium.
EfficientViT can speed up high-resolution diffusion models by compressing data with a ratio of up to 128 while keeping good image quality. It achieves a 19.1x speed increase for inference and a 17.9x speed increase for training on ImageNet 512x512 compared to other autoencoders.
Depth Any Video can generate high-resolution depth maps for videos. It uses a large dataset of 40,000 annotated clips to improve accuracy and includes a method for better depth inference across sequences of up to 150 frames.
CtrLoRA can adapt a base ControlNet for image generation with just 1,000 data pairs in under one hour of training on a single GPU. It reduces learnable parameters by 90%, making it much easier to create new guidance conditions.
SceneCraft can generate detailed indoor 3D scenes from user layouts and text descriptions. It is able to turn 3D layouts into 2D maps, producing complex spaces with diverse textures and realistic visuals.
TweedieMix can generate images and videos that combine multiple personalized concepts.
RFNet is a training-free approach that bring better prompt understanding to image generation. Adding support for prompt reasoning, conceptual and metaphorical thinking, imaginative scenarios and more.
FreeLong can generate 128 frame videos from short video diffusion models trained on 16 frame videos without requiring additional training. It’s not SOTA, but has just the right amount of cursedness 👌
Animate3D can animate any static multi-view 3D model.
VSTAR is a method that enables text-to-video models to generate longer videos with dynamic visual evolution in a single pass, without finetuning needed.
Hallo2 can create long, high-resolution (4K) animations of portrait images driven by audio. It allows users to adjust facial expressions with text labels, improving control and reducing issues like appearance drift and temporal artifacts.
Pyramidal Flow Matching can generate high-quality 5 to 10-second videos at 768p resolution and 24 FPS. It uses a unified pyramidal flow matching algorithm to link flows across different stages, making video creation more efficient.