AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
[Reference-based Image Composition with Sketch via Structure-aware Diffusion Model] can edit images by filling in missing parts using a reference image and a sketch. This method improves editability and allows for detailed changes in various scenes.
AvatarCraft can turn a text prompt into a high-quality 3D human avatar. It allows users to control the avatar’s shape and pose, making it easy to animate and reshape without retraining.
vid2vid-zero can edit videos without needing extra training on video data. It uses image diffusion models for text-to-video alignment and keeps the original video’s look and feel, allowing for effective changes to scenes and subjects.
PAIR Diffusion is a generic framework that can enable a diffusion model to control the structure and appearance properties of each object in an image. This allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations.
HyperDiffusion can generate high-quality 3D shapes and 4D mesh animations using a unified diffusion model. This method allows for the creation of complex objects and dynamic scenes from a single framework, making it versatile and efficient.
PAniC-3D can reconstruct 3D character heads from single-view anime portraits. It uses a line-filling model and a volumetric radiance field, achieving better results than previous methods and setting a new standard for stylized reconstruction.
LDMs are high-resolution image generators that can inpaint, generate images from text or bounding boxes, and do super-resolution.
Make-It-3D can create high-quality 3D content from a single image by estimating 3D shapes and adding textures. It uses a two-step process with a trained 2D diffusion model, allowing for text-to-3D creation and detailed texture editing.
eDiff-I can generate high-resolution images from text prompts using different diffusion models for each stage. It also allows users to control image creation by selecting and moving words on a canvas.
Text2Video-Zero can generate high-quality videos from text prompts using existing text-to-image diffusion models. It adds motion dynamics and cross-frame attention, making it useful for conditional video generation and instruction-guided video editing.
Vox-E can edit 3D objects by changing their shape and appearance based on text prompts. It uses a special method to keep the edited object connected to the original, allowing for both big and small changes.
MeshDiffusion can generate realistic 3D meshes using a score-based diffusion model with deformable tetrahedral grids. It is great for creating detailed 3D shapes from single images and can also add textures, making it useful for various applications.
Blind Video Deflickering by Neural Filtering with a Flawed Atlas can remove flicker from videos without needing extra guidance. It works well on different types of videos and uses a neural atlas for better consistency, outperforming other methods.
3DFuse can improve 3D scene generation by adding 3D awareness to 2D diffusion models. It builds a rough 3D structure from text prompts and uses depth maps for better realism in reconstructions.
3D Cinemagraphy can turn a single still image into a video by adding motion and depth. It uses 3D space to create realistic animations and fix common issues like artifacts and inconsistent movements.
X-Avatar can capture the full expressiveness of digital humans for lifelike experiences in telepresence and AR/VR. It uses full 3D scans or RGB-D data and outperforms other methods in animation tasks, supported by a new dataset with 35,500 high-quality frames.
Video-P2P can edit videos using advanced techniques like word swap and prompt refinement. It adapts image generation models for video, allowing for the creation of new characters while keeping original poses and scenes.
PriorMDM can generate long human motion sequences of up to 10 minutes using a pre-trained diffusion model. It allows for controlled transitions between prompted intervals and can create two-person motions with just 14 training examples, using techniques like DiffusionBlending for better control.
100kb models? Combining muliple individually learned concepts? 1-shot Personalization? Key-Locking? Perfusion just might be a new viable Stable Diffusion fine-tuning method by NVIDIA. No way to try it out yet, as there is as usual no code, but I’m keeping an eye on this one.
Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models can quickly personalize text-to-image models using just one image and only 5 training steps. This method reduces training time from minutes to seconds while maintaining quality through regularized weight-offsets.