AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Custom Diffusion can quickly fine-tune text-to-image diffusion models to generate new variations from just a few examples in about 6 minutes on 2 A100 GPUs. It allows for the combination of multiple concepts and requires only 75MB of storage for each additional model, which can be compressed to 5-15MB.
DiffusionMat is a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. The key innovation of the framework is a correction module that adjusts the output at each denoising step, ensuring that the final result is consistent with the input image’s structures.
Given one or more style references, StyleCrafter can generate images and videos based on these referenced styles.
4D-fy can generate high-quality 4D scenes from text prompts. It combines the strengths of text-to-image and text-to-video models to create dynamic scenes with great visual quality and realistic motion.
Material Palette can extract a palette of PBR materials (albedo, normals, and roughness) from a single real-world image. Looks very useful for creating new materials for 3D scenes or even for generating textures for 2D art.
Diffusion Motion Transfer is able to translate videos with a text prompt while maintaining the input video’s motion and scene layout.
Sketch Video Synthesis can turn videos into SVG sketches using frame-wise Bézier curves. It allows for impressive visual effects like resizing, color filling, and adding doodles to original images while maintaining a smooth flow between frames.
LucidDreamer can generate navigatable 3D Gaussian Splat scenes out of a single text prompt of a single image. Text prompts can also be chained for more output control. Can’t wait until they can also be animated.
LiveSketch can automatically add motion to a single-subject sketch by providing a text prompt indicating the desired motion. The output are short SVG animations which can be easily edited.
PhysGaussian is a simulation-rendering pipeline that can simulate the physics of 3D Gaussian Splats while simultaneously render photorealistic results. The method supports flexible dynamics, a diverse range of materials as well as collisions.
Concept Sliders is a method that allows for fine-grained control over textual and visual attributes in Stable Diffusion XL. By using simple text descriptions or a small set of paired images, artists can train concept sliders to represent the direction of desired attributes. At generation time, these sliders can be used to control the strength of the concept in the image, enabling nuanced tweaking.
LucidDreamer is a text-to-3D generation framework that is able to generate 3D models with high-quality textures and shapes. Higher quality means longer inference. This one takes 35 minutes on an A100 GPU.
It’s been a while since I last doomed the TikTok dancers. MagicDance is gonna doom them some more. This model can combine human motion with reference images to precisely generate appearance-consistent videos. While the results still contain visible artifacts and jittering, give it a few months and I’m sure we can’t tell the difference no more.
[The Chosen One] can generate consistent characters in text-to-image diffusion models using just a text prompt. It improves character identity and prompt alignment, making it useful for story visualization, game development, and advertising.
3D Paintbrush can automatically add textures to specific areas on 3D models using text descriptions. It produces detailed localization and texture maps, enhancing the quality of graphics in various projects.
InterpAny-Clearer is a video frame interpolation method that is able to generate clearer and sharper frames compared to existing methods. Additionally, it introduces the ability to manipulate the interpolation of objects in a video independently, which could be useful for video editing tasks.
I2VGen-XL can generate high-quality videos from static images using a cascaded diffusion model. It achieves a resolution of 1280x720 and improves the flow of movement in videos through a two-stage process that separates detail enhancement from overall coherence.
Consistent4D is an approach for generating 4D dynamic objects from uncalibrated monocular videos. With the speed we’re progressing, it looks like dynamic 3D scenes from single-cam videos will be here sooner than I’ve expected the last few weeks.
Mesh Neural Cellular Automata (MeshNCA) is a method for directly synthesizing dynamic textures on 3D meshes without requiring any UV maps. The model can be trained using different targets such as images, text prompts, and motion vector fields. Additionally, MeshNCA allows several user interactions including texture density/orientation control, a grafting brush, and motion speed/direction control.
VideoDreamer is a framework that is able to generate videos that contain the given subjects and simultaneously conform to text prompts.