AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Reduce, Reuse, Recycle can enable compositional generation using energy-based diffusion models and MCMC samplers. It improves tasks like classifier-guided ImageNet modeling and text-to-image generation by introducing new samplers that enhance performance.
Entity-Level Text-Guided Image Manipulation can edit specific parts of an image based on text descriptions while keeping other areas unchanged. It uses a two-step process for aligning meanings and making changes, allowing for flexible and precise editing.
[Tool Name] can [main function/capability]. It [key detail 1] and [key detail 2].
MultiDiffusion can generate high-quality images using a pre-trained text-to-image diffusion model. It allows users to control aspects like image size and includes features for guiding images with segmentation masks and bounding boxes.
[Projected Latent Video Diffusion Models (PVDM)] can generate high-resolution and smooth videos in a low-dimensional space. It achieves a top score of 639.7 on the UCF-101 benchmark, greatly surpassing previous methods.
Single Motion Diffusion can generate realistic animations from one input motion sequence. It allows for motion expansion, style transfer, and crowd animation, while using a lightweight design to create diverse motions efficiently.
ControlNet can add control to text-to-image diffusion models. It lets users manipulate image generation using methods like edge detection and depth maps, while working well with both small and large datasets.
Neural Congealing can align similar content across multiple images using a self-supervised method. It uses pre-trained DINO-ViT features to create a shared semantic map, allowing for effective alignment even with different appearances and backgrounds.
Hard Prompts Made Easy can automatically generate and optimize hard text-based prompts for text-to-image and text-to-text applications. It helps users tune models for classification and create image concepts without needing prior prompting knowledge, using efficient gradient-based optimization.
Pix2Pix-Zero can edit images by changing them in real-time, like turning a cat into a dog, without needing extra text prompts or training. It keeps the original image’s structure and uses pre-trained text-to-image diffusion models for better editing results.
TEXTure can generate and edit seamless textures for 3D shapes using text prompts. It uses a depth-to-image diffusion model to create consistent textures from different angles and allows for refinement based on user input.
SceneDreamer can generate endless 3D scenes from 2D image collections. It creates photorealistic images with clear depth and allows for free camera movement in the environments.
Dreamix can edit videos based on a text prompt while keeping colors, sizes, and camera angles consistent. It combines low-resolution video data with high-quality content, allowing for advanced editing of motion and appearance.
SceneScape can generate long videos of different scenes from text prompts and camera angles. It ensures 3D consistency by building a unified mesh of the scene, allowing for realistic walkthroughs in places like spaceships and caves.
Shape-aware Text-driven Layered Video Editing can edit the shape of objects in videos while keeping them consistent across frames. It uses a text-conditioned diffusion model to achieve this, making video editing more effective than other methods.
StyleGAN-T can generate high-quality images at 512x512 resolution in just 2 seconds using a single NVIDIA A100 GPU. It solves problems in text-to-image synthesis, like stable training on diverse datasets and strong text alignment.
RecolorNeRF can change colors in 3D scenes while keeping the view consistent. It breaks scenes into pure-colored layers, allowing for easy color adjustments and producing realistic results that are better than other methods.
Msanii can create high-quality music tracks up to 190 seconds long at a sample rate of 44.1 kHz. It uses a diffusion-based method to combine mel spectrograms and neural vocoders, allowing for audio-to-audio style transfer and smooth transitions between audio samples.
Robust Dynamic Radiance Fields can estimate both static and dynamic radiance fields along with camera settings. It improves view synthesis from difficult videos, achieving better quality and accuracy than current top methods.
Tune-A-Video can generate videos from a single text-video pair by fine-tuning text-to-image diffusion models. It lets users change subjects, backgrounds, and styles while keeping the video content consistent.