AI Toolbox
A curated collection of 610 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
PhysAvatar can turn multi-view videos into high-quality 3D avatars with loose-fitting clothes. The whole thing can be animated and generalizes well to unseen motions and lighting conditions.
InstantDrag can edit images quickly using drag instructions without needing masks or text prompts. It learns motion dynamics with a two-network system, allowing for real-time, photo-realistic editing.
3DTopia-XL can generate high-quality 3D PBR assets from text or image inputs in just 5 seconds.
Exploiting Diffusion Prior for Real-World Image Super-Resolution can restore high-quality images from low-resolution inputs using pre-trained text-to-image diffusion models. It allows users to balance image quality and fidelity through a controllable feature wrapping module and adapts to different image resolutions with a progressive aggregation sampling strategy.
Upscale-A-Video can upscale low-resolution videos using text prompts while keeping the video stable. It allows users to adjust noise levels for better quality and performs well in both test and real-world situations.
ExAvatar can animate expressive whole-body 3D human avatars from a short monocular video. It captures facial expressions, hand motions, and body poses in the process.
Disentangled Clothed Avatar Generation from Text Descriptions can create high-quality 3D avatars by separately modeling human bodies and clothing. This method improves texture and geometry quality and aligns well with text prompts, enhancing virtual try-on and character animation.
MagicMan can generate high-quality 3D images and normal maps of humans from a single photo.
TurboEdit enables fast text-based image editing in just 3-4 diffusion steps! It improves edit quality and preserves the original image by using a shifted noise schedule and a pseudo-guidance approach, tackling issues like visual artifacts and weak edits.
DepthCrafter can generate long high-quality depth map sequences for videos. It uses a three-stage training method with a pre-trained image-to-video diffusion model, achieving top performance in depth estimation for visual effects and video generation.
DreamBeast can generate unique 3D animal assets with different parts. It uses a method from Stable Diffusion 3 to quickly create detailed Part-Affinity maps from various camera views, improving quality while saving computing power.
DrawingSpinUp can animate 3D characters from a single 2D drawing. It removes unnecessary lines and uses a skeleton-based algorithm to allow characters to spin, jump, and dance.
DreamHOI can generate realistic 3D human-object interactions (HOIs) by posing a skinned human model to interact with objects based on text descriptions. It uses text-to-image diffusion models to create diverse interactions without needing large datasets.
TextBoost can enable one-shot personalization of text-to-image models by fine-tuning the text encoder. It generates diverse images from a single reference image while reducing overfitting and memory needs.
ProbTalk3D can generate 3D facial animations that show different emotions based on audio input! It uses a two-stage VQ-VAE model and the 3DMEAD dataset, allowing for diverse facial expressions and accurate lip-syncing.
GVHMR can recover human motion from monocular videos by estimating poses in a Gravity-View coordinate system aligned with gravity and the camera.
NeuroPictor can improve fMRI-to-image reconstruction by using fMRI signals to control diffusion models. It is trained on over 67,000 fMRI-image pairs, allowing for better accuracy in generating images that reflect both high-level concepts and fine details.
Text2Place can place any human or object realistically into diverse backgrounds. This enables scene hallucination by generating compatible scenes for the given pose of the human, text-based editing of the human and placing multiple persons into a scene.
One-DM can generate handwritten text from a single reference sample, mimicking the style of the input. It captures unique writing patterns and works well across multiple languages.
Diffusion2GAN is a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference while preserving image quality. This enables one-step 512px/1024px image generation at an interactive speed of 0.09/0.16 second as well as 4k image upscaling!