AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
MoRAG can generate and retrieve human motion from text by improving motion diffusion models.
FramePainter can edit images using simple sketches and video diffusion methods. It allows for realistic changes, like altering reflections or transforming objects, while needing less training data and performing well in different situations.
Yin-Yang can generate music with a clear structure and control over melodies.
One-Prompt-One-Story can generate consistent images from a single text prompt by combining all prompts into one input for text-to-image models.
Video Depth Anything can estimate depth in long videos while keeping a fast speed of 30 frames per second.
Hunyuan3D 2.0 can generate high-resolution textured 3D assets. It allows users to create and animate detailed 3D models efficiently, with improved geometry detail and texture quality compared to previous models.
X-Dyna can animate a single human image by transferring facial expressions and body movements from a video.
ReF-LDM can restore low-quality face images by using multiple high-quality reference images.
RepVideo can improve video generation by making visuals look better and ensuring smooth transitions.
VISION-XL can deblur and upscale videos using SDXL. It supports different aspect ratios and can produce HD videos in under 2.5 minutes on a single NVIDIA 4090 GPU, using only 13GB of VRAM for 25-frame videos.
Splatter a Video can turn a video into a 3D Gaussian representation, allowing for enhanced video tracking, depth prediction, motion and appearance editing, and stereoscopic video generation.
Go-with-the-Flow can control motion patterns in video diffusion models using real-time warped noise from optical flow fields. It allows users to manipulate object movements and camera motions while keeping high image quality and not needing changes to existing models.
Chat2SVG can generate and edit SVG vector graphics from text prompts. It combines Large Language Models and image diffusion models to create detailed SVG templates and allows users to refine them with simple language instructions.
GaussianDreamerPro can generate 3D Gaussian assets from text that can be seamlessly integrated into downstream manipulation pipelines, such as animation, composition, and simulation.
Kinetic Typography Diffusion Model can generate kinetic typography videos with legible and artistic letter motions based on text prompts.
UniVG is yet another video generation system. The highlight of UniVG is its ability to use image inputs for guidance and modify and guide generation with additional text prompts. Haven’t seen other video models do this yet.
TryOffDiff can generate high-quality images of clothing from photos of people wearing them.
LLM4GEN enhances the semantic understanding ability of text-to-image diffusion models by leveraging the semantic representation of LLMs. Meaning: More complex and dense prompts that involve multiple objects, attribute binding, and long descriptions.
TransPixar can generate RGBA videos, enabling the creation of transparent elements like smoke and reflections that blend seamlessly into scenes.
Coin3D can generate and edit 3D assets from a basic input shape. Similar to ControlNet, this enables precise part editing and responsive 3D object previewing within a few seconds.