AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
MagicMan can generate high-quality 3D images and normal maps of humans from a single photo.
TurboEdit enables fast text-based image editing in just 3-4 diffusion steps! It improves edit quality and preserves the original image by using a shifted noise schedule and a pseudo-guidance approach, tackling issues like visual artifacts and weak edits.
DepthCrafter can generate long high-quality depth map sequences for videos. It uses a three-stage training method with a pre-trained image-to-video diffusion model, achieving top performance in depth estimation for visual effects and video generation.
DreamBeast can generate unique 3D animal assets with different parts. It uses a method from Stable Diffusion 3 to quickly create detailed Part-Affinity maps from various camera views, improving quality while saving computing power.
DrawingSpinUp can animate 3D characters from a single 2D drawing. It removes unnecessary lines and uses a skeleton-based algorithm to allow characters to spin, jump, and dance.
DreamHOI can generate realistic 3D human-object interactions (HOIs) by posing a skinned human model to interact with objects based on text descriptions. It uses text-to-image diffusion models to create diverse interactions without needing large datasets.
TextBoost can enable one-shot personalization of text-to-image models by fine-tuning the text encoder. It generates diverse images from a single reference image while reducing overfitting and memory needs.
ProbTalk3D can generate 3D facial animations that show different emotions based on audio input! It uses a two-stage VQ-VAE model and the 3DMEAD dataset, allowing for diverse facial expressions and accurate lip-syncing.
GVHMR can recover human motion from monocular videos by estimating poses in a Gravity-View coordinate system aligned with gravity and the camera.
NeuroPictor can improve fMRI-to-image reconstruction by using fMRI signals to control diffusion models. It is trained on over 67,000 fMRI-image pairs, allowing for better accuracy in generating images that reflect both high-level concepts and fine details.
Text2Place can place any human or object realistically into diverse backgrounds. This enables scene hallucination by generating compatible scenes for the given pose of the human, text-based editing of the human and placing multiple persons into a scene.
One-DM can generate handwritten text from a single reference sample, mimicking the style of the input. It captures unique writing patterns and works well across multiple languages.
FlexiClip can generate smooth animations from clipart images while keeping key points in the right place.
Diffusion2GAN is a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference while preserving image quality. This enables one-step 512px/1024px image generation at an interactive speed of 0.09/0.16 second as well as 4k image upscaling!
LinFusion can generate high-resolution images up to 16K in just one minute using a single GPU. It improves performance on various Stable Diffusion versions and works with pre-trained components like ControlNet and IP-Adapter.
ViewCrafter can generate high-quality 3D views from single or few images using a video diffusion model. It allows for precise camera control and is useful for real-time rendering and turning text into 3D scenes.
CSGO can perform image-driven style transfer and text-driven stylized synthesis. It uses a large dataset with 210k image triplets to improve style control in image generation.
HumanVid can generate videos from a character photo while allowing users to control both human and camera motions. It introduces a large-scale dataset that combines high-quality real-world and synthetic data, achieving state-of-the-art performance in camera-controllable human image animation.
Follow-Your-Canvas can outpaint videos at higher resolutions, from 512x512 to 1152x2048.
TSTMotion can generate human motion sequences aware of their surrounding 3D scene from text prompts.