AI Toolbox

A curated collection of 935 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.

AI Tools

3D Audio Image Image </assigned_modality> <assigned_tasks> Image Upscaling </assigned_tasks> <assigned_modality> Video Text Video

LayerPano3D

LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.

10.02.25 · Project Page · Code · Text-to-3D · 3D Scene Generation

FlashVideo

FlashVideo can generate videos from text prompts and upscale them to 1080p.

10.02.25 · Project Page · Code · Model · Video Upscaling · Text-to-Video

Semantic Gesticulator

Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.

07.02.25 · Project Page · Code · Audio-to-Motion

VideoGuide

VideoGuide can improve the quality of videos made by text-to-video models without needing extra training. It enhances the smoothness of motion and clarity of images, making the videos more coherent and visually appealing.

07.02.25 · Project Page · Code · Text-to-Video

Multi-subject Open-set Personalization in Video Generation

Video Alchemist can generate personalized videos using text prompts and reference images. It supports multiple subjects and backgrounds without long setup times, achieving high-quality results with better subject fidelity and text alignment.

07.02.25 · Project Page · Code · Personalized Video Generation

Generating Human Interaction Motions in Scenes with Text Control

TeSMo is a method for text-controlled scene-aware motion generation and is able to generate realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses.

06.02.25 · Project Page · Code · Text-to-Motion

MotionLab

MotionLab can generate and edit human motion and supports text-based and trajectory-based motion creation.

05.02.25 · Project Page · Code · Text-to-Motion · Motion Generation · Motion Editing

SMF

SMF can transfer 2D or 3D keypoint animations to full-body mesh animations without needing template meshes or corrective keyframes.

04.02.25 · Project Page · Code · Motion Generation · 3D Animation

ControlFace

ControlFace can edit face images with precise control over pose, expression, and lighting. It uses a dual-branch U-Net architecture and is trained on facial videos to ensure high-quality results while keeping the person’s identity intact.

04.02.25 · Project Page · Code · Image Editing

OmniPhysGS

OmniPhysGS can generate realistic 3D dynamic scenes by modeling objects with Constitutive 3D Gaussians.

03.02.25 · Project Page · Code · 3D Object Generation · 3D Scene Generation

GestureLSM

GestureLSM can generate real-time co-speech gestures by modeling how different body parts interact.

31.01.25 · Project Page · Code · Motion Generation · Audio-to-Motion

Imagine360

Imagine360 can generate high-quality 360° videos from monologue single-view videos.

30.01.25 · Project Page · Code · Video-to-Video

Wonderland

Wonderland can generate high-quality 3D scenes from a single image using a camera-guided video diffusion model. It allows for easy navigation and exploration of 3D spaces, performing better than other methods, especially with images it hasn’t seen before.

29.01.25 · Project Page · Code · 3D Scene Generation

DiffSplat

DiffSplat can generate 3D Gaussian splats from text prompts and single-view images in 1-2 seconds.

29.01.25 · Project Page · Code · Text-to-3D · Image-to-3D

Stable Flow

Stable Flow can edit images by adding, removing, or changing objects.

28.01.25 · Project Page · Code · Image Editing

DELTA

DELTA can track dense 3D motion from single-camera videos with high accuracy. It uses advanced techniques to speed up the process, making it over 8 times faster than older methods while maintaining pixel-level precision.

28.01.25 · Project Page · Code · 3D Object Detection · Video Object Tracking

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion

MoRAG can generate and retrieve human motion from text by improving motion diffusion models.

27.01.25 · Project Page · Code · Text-to-Motion

FramePainter

FramePainter can edit images using simple sketches and video diffusion methods. It allows for realistic changes, like altering reflections or transforming objects, while needing less training data and performing well in different situations.

25.01.25 · Code · Image Editing

Yin-Yang

Yin-Yang can generate music with a clear structure and control over melodies.

24.01.25 · Code · Text-to-Music

One-Prompt-One-Story

One-Prompt-One-Story can generate consistent images from a single text prompt by combining all prompts into one input for text-to-image models.

24.01.25 · Project Page · Code · Text-to-Image · Image Editing