AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Latent Consistency Models can generate high-resolution images in just 2-4 steps, making text-to-image generation much faster than traditional methods. They require only 32 A100 GPU hours for training on a 768x768 resolution, which is efficient for high-quality results.
DREAM can reconstruct images seen by a person from their brain activity using an fMRI-to-image method. It decodes important details like color and depth, and it performs better than other models in keeping the appearance and structure consistent.
HumanNorm is a novel approach for high-quality and realistic 3D human generation by leveraging normal maps which enhances the 2D perception of 3D geometry. The results are quite impressive and comparable with PS3 games.
Ground-A-Video can edit multiple attributes of a video using pre-trained text-to-image models without any training. It maintains consistency across frames and accurately preserves non-target areas, making it more effective than other editing methods.
DA-CLIP is a method that can be used to restore images. Apart from inpainting, the method is able to restore images by dehazing, deblurring, denoising, derainining and desnowing them as well as removing unwanted shadows and raindrops or enhance lighting on low-light images.
PIXART-α can generate high-quality images at a resolution of up to 1024px. It reduces training time to 10.8% of Stable Diffusion v1.5, costing about $26,000 and emitting 90% less CO2.
LLM-grounded Video Diffusion Models can generate realistic videos from complex text prompts. They first create dynamic scene layouts with a large language model, which helps guide the video creation process, resulting in better accuracy for object movements and actions.
DreamGaussian can generate high-quality textured meshes from a single-view image in just 2 minutes. It uses a 3D Gaussian Splatting model for fast mesh extraction and texture refinement.
AnimeInbet is a method that is able to generate inbetween frames for cartoon line drawings. Seeing this, we’ll hopefully be blessed with higher framerate animes in the near future.
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation can generate diverse and realistic videos that match natural audio samples. It uses a lightweight adaptor network to improve alignment and visual quality compared to other methods.
Show-1 can generate high-quality videos with accurate text-video alignment. It uses only 15G of GPU memory during inference, which is much less than the 72G needed by traditional models.
PGDiff can restore and colorize faces from low-quality images by using details from high-quality images. It effectively fixes issues like scratches and blurriness.
Generative Repainting can paint 3D assets using text prompts. It uses pretrained 2D diffusion models and 3D neural radiance fields to create high-quality textures for various 3D shapes.
TECA can generate realistic 3D avatars from text descriptions. It combines traditional 3D meshes for faces and bodies with neural radiance fields (NeRF) for hair and clothing, allowing for high-quality, editable avatars and easy feature transfer between them.
InstaFlow can generate high-quality images in just one step, achieving an FID of 23.3 on MS COCO 2017-5k. It works very fast at about 0.09 seconds per image, using much less computing power than traditional diffusion models.
ProPainter is a new video inpainting method that is able to remove objects, complete masked videos, remove watermarks and even expand the view of a video.
Another video synthesis model that caught my eye this week is Reuse and Diffuse. The novel framework for text-to-video generation adds the ability to generate more frames from an initial video clip by reusing and iterating over the original latent features. Can’t wait to give this one a try.
SyncDreamer is able to generate multiview-consistent images from a single-view image and thus is able to generate 3D models from 2D designs and hand drawings. It wasn’t able to help me in my quest to turn my PFP into a 3D avatar, but someday I’ll get there!
Hierarchical Masked 3D Diffusion Model for Video Outpainting can fill in missing parts at the edges of video frames while keeping the motion smooth. It uses a smart method that reduces errors and improves results by looking at multiple frames.
[Total Selfie] can generate high-quality full-body selfies from close-up selfies and background images. It uses a diffusion-based approach to combine these inputs, creating realistic images in desired poses and overcoming the limits of traditional selfies.