AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
MotionCtrl is a flexible motion controller that is able to manage both camera and object motions in the generated videos and can be used with VideoCrafter1, AnimateDiff Stable Video Diffusion.
DPM-Solver can generate high-quality samples from diffusion probabilistic models in just 10 to 20 function evaluations. It is 4 to 16 times faster than previous methods and works with both discrete-time and continuous-time models without extra training.
AmbiGen can generate ambigrams by optimizing letter shapes for clear reading from two angles. It improves word accuracy by over 11.6% and reduces edit distance by 41.9% on the 500 most common English words.
Readout Guidance can control text-to-image diffusion models using lightweight networks called readout heads. It enables pose, depth, and edge-guided generation with fewer parameters and training samples, allowing for easier manipulation and consistent identity generation.
X-Adapter can enable pretrained plugins like ControlNet and LoRA from Stable Diffusion 1.5 to work with the SDXL model without retraining. It adds trainable mapping layers for feature remapping and uses a null-text training strategy to improve compatibility and functionality.
Custom Diffusion can quickly fine-tune text-to-image diffusion models to generate new variations from just a few examples in about 6 minutes on 2 A100 GPUs. It allows for the combination of multiple concepts and requires only 75MB of storage for each additional model, which can be compressed to 5-15MB.
DiffusionMat is a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. The key innovation of the framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image’s structures.
Given one or more style references, StyleCrafter can generate images and videos based on these referenced styles.
4D-fy can generate high-quality 4D scenes from text prompts. It combines the strengths of text-to-image and text-to-video models to create dynamic scenes with great visual quality and realistic motion.
Material Palette can extract a palette of PBR materials (albedo, normals, and roughness) from a single real-world image. Looks very useful for creating new materials for 3D scenes or even for generating textures for 2D art.
Diffusion Motion Transfer is able to translate videos with a text prompt while maintaining the input video’s motion and scene layout.
Sketch Video Synthesis can turn videos into SVG sketches using frame-wise Bézier curves. It allows for impressive visual effects like resizing, color filling, and adding doodles to original images while maintaining a smooth flow between frames.
LucidDreamer can generate navigatable 3D Gaussian Splat scenes out of a single text prompt of a single image. Text prompts can also be chained for more output control. Can’t wait until they can also be animated.
LiveSketch can automatically add motion to a single-subject sketch by providing a text prompt indicating the desired motion. The output are short SVG animations which can be easily edited.
PhysGaussian is a simulation-rendering pipeline that can simulate the physics of 3D Gaussian Splats while simultaneously render photorealistic results. The method supports flexible dynamics, a diverse range of materials as well as collisions.
Concept Sliders is a method that allows for fine-grained control over textual and visual attributes in Stable Diffusion XL. By using simple text descriptions or a small set of paired images, artists can train concept sliders to represent the direction of desired attributes. At generation time, these sliders can be used to control the strength of the concept in the image, enabling nuanced tweaking.
LucidDreamer is a text-to-3D generation framework that is able to generate 3D models with high-quality textures and shapes. Higher quality means longer inference. This one takes 35 minutes on an A100 GPU.
It’s been a while since I last doomed the TikTok dancers. MagicDance is gonna doom them some more. This model can combine human motion with reference images to precisely generate appearance-consistent videos. While the results still contain visible artifacts and jittering, give it a few months and I’m sure we can’t tell the difference no more.
[The Chosen One] can generate consistent characters in text-to-image diffusion models using just a text prompt. It improves character identity and prompt alignment, making it useful for story visualization, game development, and advertising.
3D Paintbrush can automatically add textures to specific areas on 3D models using text descriptions. It produces detailed localization and texture maps, enhancing the quality of graphics in various projects.