AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
PAniC-3D can reconstruct 3D character heads from single-view anime portraits. It uses a line-filling model and a volumetric radiance field, achieving better results than previous methods and setting a new standard for stylized reconstruction.
LDMs are high-resolution image generators that can inpaint, generate images from text or bounding boxes, and do super-resolution.
Make-It-3D can create high-quality 3D content from a single image by estimating 3D shapes and adding textures. It uses a two-step process with a trained 2D diffusion model, allowing for text-to-3D creation and detailed texture editing.
eDiff-I can generate high-resolution images from text prompts using different diffusion models for each stage. It also allows users to control image creation by selecting and moving words on a canvas.
Text2Video-Zero can generate high-quality videos from text prompts using existing text-to-image diffusion models. It adds motion dynamics and cross-frame attention, making it useful for conditional video generation and instruction-guided video editing.
Vox-E can edit 3D objects by changing their shape and appearance based on text prompts. It uses a special method to keep the edited object connected to the original, allowing for both big and small changes.
MeshDiffusion can generate realistic 3D meshes using a score-based diffusion model with deformable tetrahedral grids. It is great for creating detailed 3D shapes from single images and can also add textures, making it useful for various applications.
Blind Video Deflickering by Neural Filtering with a Flawed Atlas can remove flicker from videos without needing extra guidance. It works well on different types of videos and uses a neural atlas for better consistency, outperforming other methods.
3DFuse can improve 3D scene generation by adding 3D awareness to 2D diffusion models. It builds a rough 3D structure from text prompts and uses depth maps for better realism in reconstructions.
3D Cinemagraphy can turn a single still image into a video by adding motion and depth. It uses 3D space to create realistic animations and fix common issues like artifacts and inconsistent movements.
X-Avatar can capture the full expressiveness of digital humans for lifelike experiences in telepresence and AR/VR. It uses full 3D scans or RGB-D data and outperforms other methods in animation tasks, supported by a new dataset with 35,500 high-quality frames.
Video-P2P can edit videos using advanced techniques like word swap and prompt refinement. It adapts image generation models for video, allowing for the creation of new characters while keeping original poses and scenes.
PriorMDM can generate long human motion sequences of up to 10 minutes using a pre-trained diffusion model. It allows for controlled transitions between prompted intervals and can create two-person motions with just 14 training examples, using techniques like DiffusionBlending for better control.
Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models can quickly personalize text-to-image models using just one image and only 5 training steps. This method reduces training time from minutes to seconds while maintaining quality through regularized weight-offsets.
Reduce, Reuse, Recycle can enable compositional generation using energy-based diffusion models and MCMC samplers. It improves tasks like classifier-guided ImageNet modeling and text-to-image generation by introducing new samplers that enhance performance.
Entity-Level Text-Guided Image Manipulation can edit specific parts of an image based on text descriptions while keeping other areas unchanged. It uses a two-step process for aligning meanings and making changes, allowing for flexible and precise editing.
[Tool Name] can [main function/capability]. It [key detail 1] and [key detail 2].
MultiDiffusion can generate high-quality images using a pre-trained text-to-image diffusion model. It allows users to control aspects like image size and includes features for guiding images with segmentation masks and bounding boxes.
[Projected Latent Video Diffusion Models (PVDM)] can generate high-resolution and smooth videos in a low-dimensional space. It achieves a top score of 639.7 on the UCF-101 benchmark, greatly surpassing previous methods.
Single Motion Diffusion can generate realistic animations from one input motion sequence. It allows for motion expansion, style transfer, and crowd animation, while using a lightweight design to create diverse motions efficiently.