AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Similar to ControlNet and Composer, IP-Adapter is a mutli-modal guidance adapter for image prompts which works with Stable Diffusion models trained on the same base model. The results look amazing.
Semantics2Hands can retarget realistic hand motions between different avatars while keeping the details of the movements. It uses an anatomy-based semantic matrix and a semantics reconstruction network to achieve high-quality hand motion transfer.
PlankAssembly can turn 2D line drawings from three views into 3D CAD models. It effectively handles noisy or incomplete inputs and improves accuracy using shape programs.
AudioLDM 2 can generate high-quality audio in different forms, like text-to-audio and image-to-audio. It uses a smart training method to achieve top performance on important tests.
AudioSep can separate audio events and musical instruments while enhancing speech using natural language queries. It performs well in open-domain audio source separation, significantly surpassing previous models.
3D Gaussian Splatting can create high-quality 3D scenes in real-time at 1080p resolution with over 30 frames per second. It uses 3D Gaussians for efficient scene representation and a fast rendering method, achieving competitive training times while maintaining great visual quality.
RIP expensive low-light cameras? It’s amazing how AI is able to solve problems which so far was only possible with better hardware. In this example the novel LED model is able to denoise low-light images trained on only 6 pairs of images. The results are impressive, but the team is not done yet. They’re currently researching a method that works on a wide variety of scenarios trained on only 2 pairs.
LP-MusicCaps can generate high-quality music captions using large language models (LLMs).
DWPose is a post estimator that uses a two-stage distillation approach to improve the accuracy of the pose estimation.
WavJourney is a system that uses large language models to generate audio content with storylines encompassing speech, music, and sound effects guided from text instructions. The demo results, while not perfect, sound great.
Interpolating between Images with Diffusion Models can generate smooth transitions between two images using latent diffusion models. It allows for high-quality results across different styles and subjects while using CLIP to select the best images for interpolation.
TokenFlow is a new video-to-video method for temporal coherent video editing with text. We’ve seen a lot of them, but this one looks extremely good with almost no flickering and requires no fine-tuning whatsoever.
FABRIC can condition diffusion models on feedback images to improve image quality. This method allows users to personalize content through multiple feedback rounds without needing training.
AnimateDiff is a new framework that brings video generation to the Stable Diffusion pipeline. Meaning you can generate videos with any already existing Stable Diffusion models without having to fine-tune or train anything. Pretty amazing. @DigThatData put together a Google Colab notebook in case you want to give it a try.
Text2Cinemagraph can create cinemagraphs from text descriptions, animating elements like flowing rivers and drifting clouds. It combines artistic images with realistic ones to accurately show motion, outperforming other methods in generating cinemagraphs for natural and artistic scenes.
CSD-Edit is a multi modality editing approach that compared to other methods works great on images bigger than the traditional 512x512 limitation and can edit 4k or large panorama images, has improved temporal consistency on video frames as well as improved view consistency when editing or generating 3D scenes.
Similar like ControlNet scribble for images, SketchMetaFace brings sketch guidance to the 3D realm and makes it possible to turn a sketch into a 3D face model. Pretty excited about progress like this, as this will bring controllability to 3D generations and make generating 3D content way more accessible.
NIS-SLAM can reconstruct high-fidelity surfaces and geometry from RGB-D frames. It also learns 3D consistent semantic representations during this process.
DreamDiffusion can generate high-quality images from brain EEG signals without needing to translate thoughts into text. It uses pre-trained text-to-image models and special techniques to handle noise and individual differences, making it a key step towards affordable thoughts-to-image technology.
MotionGPT can generate, caption, and predict human motion by treating it like a language. It achieves top performance in these tasks, making it useful for various motion-related applications.