AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
While ZeroScope, Gen-2, PikaLabs and others have brought us high resolution text- and image-to-video, they all suffer from unsmooth video transition, crude video motion and action occurrence disorder. The new Dysen-VDM tries to tackle those issues, and while nowhere near perfect, delivers some promising results.
Scenimefy can turn real-world images and videos into high-quality anime scenes. It uses a smart method that keeps important details and produces better results than other tools.
StableVideo is yet another vid2vid method. This one is not just a style transfer though, the method is able to differentiate between fore- and background when editing a video, making it possible to reimagine the subject within an entirely different landscape.
CoDeF can process videos consistently by using a canonical content field to gather static content and a temporal deformation field to track changes over time. This allows it to perform tasks like video-to-video translation and track moving objects, such as water and smog, without needing extra training.
CLE Diffusion can enhance low-light images by letting users control brightness levels and choose specific areas for improvement. It uses an illumination embedding and the Segment-Anything Model (SAM) for precise and natural-looking enhancements.
Similar to ControlNet and Composer, IP-Adapter is a mutli-modal guidance adapter for image prompts which works with Stable Diffusion models trained on the same base model. The results look amazing.
Semantics2Hands can retarget realistic hand motions between different avatars while keeping the details of the movements. It uses an anatomy-based semantic matrix and a semantics reconstruction network to achieve high-quality hand motion transfer.
PlankAssembly can turn 2D line drawings from three views into 3D CAD models. It effectively handles noisy or incomplete inputs and improves accuracy using shape programs.
AudioLDM 2 can generate high-quality audio in different forms, like text-to-audio and image-to-audio. It uses a smart training method to achieve top performance on important tests.
AudioSep can separate audio events and musical instruments while enhancing speech using natural language queries. It performs well in open-domain audio source separation, significantly surpassing previous models.
3D Gaussian Splatting can create high-quality 3D scenes in real-time at 1080p resolution with over 30 frames per second. It uses 3D Gaussians for efficient scene representation and a fast rendering method, achieving competitive training times while maintaining great visual quality.
RIP expensive low-light cameras? It’s amazing how AI is able to solve problems which so far was only possible with better hardware. In this example the novel LED model is able to denoise low-light images trained on only 6 pairs of images. The results are impressive, but the team is not done yet. They’re currently researching a method that works on a wide variety of scenarios trained on only 2 pairs.
LP-MusicCaps can generate high-quality music captions using large language models (LLMs).
DWPose is a post estimator that uses a two-stage distillation approach to improve the accuracy of the pose estimation.
WavJourney is a system that uses large language models to generate audio content with storylines encompassing speech, music, and sound effects guided from text instructions. The demo results, while not perfect, sound great.
Interpolating between Images with Diffusion Models can generate smooth transitions between two images using latent diffusion models. It allows for high-quality results across different styles and subjects while using CLIP to select the best images for interpolation.
TokenFlow is a new video-to-video method for temporal coherent video editing with text. We’ve seen a lot of them, but this one looks extremely good with almost no flickering and requires no fine-tuning whatsoever.
FABRIC can condition diffusion models on feedback images to improve image quality. This method allows users to personalize content through multiple feedback rounds without needing training.
AnimateDiff is a new framework that brings video generation to the Stable Diffusion pipeline. Meaning you can generate videos with any already existing Stable Diffusion models without having to fine-tune or train anything. Pretty amazing. @DigThatData put together a Google Colab notebook in case you want to give it a try.
Text2Cinemagraph can create cinemagraphs from text descriptions, animating elements like flowing rivers and drifting clouds. It combines artistic images with realistic ones to accurately show motion, outperforming other methods in generating cinemagraphs for natural and artistic scenes.