AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
PIXART-α can generate high-quality images at a resolution of up to 1024px. It reduces training time to 10.8% of Stable Diffusion v1.5, costing about $26,000 and emitting 90% less CO2.
LLM-grounded Video Diffusion Models can generate realistic videos from complex text prompts. They first create dynamic scene layouts with a large language model, which helps guide the video creation process, resulting in better accuracy for object movements and actions.
DreamGaussian can generate high-quality textured meshes from a single-view image in just 2 minutes. It uses a 3D Gaussian Splatting model for fast mesh extraction and texture refinement.
AnimeInbet is a method that is able to generate inbetween frames for cartoon line drawings. Seeing this, we’ll hopefully be blessed with higher framerate animes in the near future.
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation can generate diverse and realistic videos that match natural audio samples. It uses a lightweight adaptor network to improve alignment and visual quality compared to other methods.
Show-1 can generate high-quality videos with accurate text-video alignment. It uses only 15G of GPU memory during inference, which is much less than the 72G needed by traditional models.
PGDiff can restore and colorize faces from low-quality images by using details from high-quality images. It effectively fixes issues like scratches and blurriness.
Generative Repainting can paint 3D assets using text prompts. It uses pretrained 2D diffusion models and 3D neural radiance fields to create high-quality textures for various 3D shapes.
TECA can generate realistic 3D avatars from text descriptions. It combines traditional 3D meshes for faces and bodies with neural radiance fields (NeRF) for hair and clothing, allowing for high-quality, editable avatars and easy feature transfer between them.
InstaFlow can generate high-quality images in just one step, achieving an FID of 23.3 on MS COCO 2017-5k. It works very fast at about 0.09 seconds per image, using much less computing power than traditional diffusion models.
ProPainter is a new video inpainting method that is able to remove objects, complete masked videos, remove watermarks and even expand the view of a video.
Another video synthesis model that caught my eye this week is Reuse and Diffuse. The novel framework for text-to-video generation adds the ability to generate more frames from an initial video clip by reusing and iterating over the original latent features. Can’t wait to give this one a try.
SyncDreamer is able to generate multiview-consistent images from a single-view image and thus is able to generate 3D models from 2D designs and hand drawings. It wasn’t able to help me in my quest to turn my PFP into a 3D avatar, but someday I’ll get there!
Hierarchical Masked 3D Diffusion Model for Video Outpainting can fill in missing parts at the edges of video frames while keeping the motion smooth. It uses a smart method that reduces errors and improves results by looking at multiple frames.
[Total Selfie] can generate high-quality full-body selfies from close-up selfies and background images. It uses a diffusion-based approach to combine these inputs, creating realistic images in desired poses and overcoming the limits of traditional selfies.
While ZeroScope, Gen-2, PikaLabs and others have brought us high resolution text- and image-to-video, they all suffer from unsmooth video transition, crude video motion and action occurrence disorder. The new Dysen-VDM tries to tackle those issues, and while nowhere near perfect, delivers some promising results.
Scenimefy can turn real-world images and videos into high-quality anime scenes. It uses a smart method that keeps important details and produces better results than other tools.
StableVideo is yet another vid2vid method. This one is not just a style transfer though, the method is able to differentiate between fore- and background when editing a video, making it possible to reimagine the subject within an entirely different landscape.
CoDeF can process videos consistently by using a canonical content field to gather static content and a temporal deformation field to track changes over time. This allows it to perform tasks like video-to-video translation and track moving objects, such as water and smog, without needing extra training.
CLE Diffusion can enhance low-light images by letting users control brightness levels and choose specific areas for improvement. It uses an illumination embedding and the Segment-Anything Model (SAM) for precise and natural-looking enhancements.