AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





DreamBeast can generate unique 3D animal assets with different parts. It uses a method from Stable Diffusion 3 to quickly create detailed Part-Affinity maps from various camera views, improving quality while saving computing power.
DrawingSpinUp can animate 3D characters from a single 2D drawing. It removes unnecessary lines and uses a skeleton-based algorithm to allow characters to spin, jump, and dance.
DreamHOI can generate realistic 3D human-object interactions (HOIs) by posing a skinned human model to interact with objects based on text descriptions. It uses text-to-image diffusion models to create diverse interactions without needing large datasets.
TextBoost can enable one-shot personalization of text-to-image models by fine-tuning the text encoder. It generates diverse images from a single reference image while reducing overfitting and memory needs.
ProbTalk3D can generate 3D facial animations that show different emotions based on audio input! It uses a two-stage VQ-VAE model and the 3DMEAD dataset, allowing for diverse facial expressions and accurate lip-syncing.
GVHMR can recover human motion from monocular videos by estimating poses in a Gravity-View coordinate system aligned with gravity and the camera.
NeuroPictor can improve fMRI-to-image reconstruction by using fMRI signals to control diffusion models. It is trained on over 67,000 fMRI-image pairs, allowing for better accuracy in generating images that reflect both high-level concepts and fine details.
Text2Place can place any human or object realistically into diverse backgrounds. This enables scene hallucination by generating compatible scenes for the given pose of the human, text-based editing of the human and placing multiple persons into a scene.
One-DM can generate handwritten text from a single reference sample, mimicking the style of the input. It captures unique writing patterns and works well across multiple languages.
FlexiClip can generate smooth animations from clipart images while keeping key points in the right place.
Diffusion2GAN is a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference while preserving image quality. This enables one-step 512px/1024px image generation at an interactive speed of 0.09/0.16 second as well as 4k image upscaling!
LinFusion can generate high-resolution images up to 16K in just one minute using a single GPU. It improves performance on various Stable Diffusion versions and works with pre-trained components like ControlNet and IP-Adapter.
ViewCrafter can generate high-quality 3D views from single or few images using a video diffusion model. It allows for precise camera control and is useful for real-time rendering and turning text into 3D scenes.
CSGO can perform image-driven style transfer and text-driven stylized synthesis. It uses a large dataset with 210k image triplets to improve style control in image generation.
HumanVid can generate videos from a character photo while allowing users to control both human and camera motions. It introduces a large-scale dataset that combines high-quality real-world and synthetic data, achieving state-of-the-art performance in camera-controllable human image animation.
Follow-Your-Canvas can outpaint videos at higher resolutions, from 512x512 to 1152x2048.
TSTMotion can generate human motion sequences aware of their surrounding 3D scene from text prompts.
LogoMotion can turn logos from layered PDF files into content-aware animated HTML canvas animations. Very cool!
KEEP can enhance video face super-resolution by maintaining consistency across frames. It uses Kalman filtering to improve facial details, working well on both synthetic and real-world videos.
tps-inbetween can generate high-quality intermediate frames for animation line art. It effectively connects lines and fills in missing details, even during fast movements, using a method that models keypoint relationships between frames.