AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
CSD-Edit is a multi modality editing approach that compared to other methods works great on images bigger than the traditional 512x512 limitation and can edit 4k or large panorama images, has improved temporal consistency on video frames as well as improved view consistency when editing or generating 3D scenes.
Similar like ControlNet scribble for images, SketchMetaFace brings sketch guidance to the 3D realm and makes it possible to turn a sketch into a 3D face model. Pretty excited about progress like this, as this will bring controllability to 3D generations and make generating 3D content way more accessible.
NIS-SLAM can reconstruct high-fidelity surfaces and geometry from RGB-D frames. It also learns 3D consistent semantic representations during this process.
DreamDiffusion can generate high-quality images from brain EEG signals without needing to translate thoughts into text. It uses pre-trained text-to-image models and special techniques to handle noise and individual differences, making it a key step towards affordable thoughts-to-image technology.
MotionGPT can generate, caption, and predict human motion by treating it like a language. It achieves top performance in these tasks, making it useful for various motion-related applications.
DiffSketcher is a tool that can turn words into vectorized free-hand sketches. The method also supports the ability to define the level of abstraction, allowing for more abstract or concrete generations.
Diffusion with Forward Models is a able to reconstruct 3D scenes from a single input image. Additionally it’s also able to add small and short motions to images with people in them.
It’s said that our eyes hold the universe. When it comes to the method discussed in the paper Seeing the World through Your Eyes, they at least hold a 3D scene. The method discussed in the paper is able to reconstruct 3D scenes beyond the camera’s line-of-sight using portrait images containing eye reflections.
We’ve already seen a few attempts at bringing ControlNet to video, but getting temporal coherency right seems to be a trick issue to solve. ControlVideo is the next attempt and things start to look extremely promising.
Neuralangelo can reconstruct detailed 3D surfaces from RGB video captures. It uses multi-resolution 3D hash grids and neural surface rendering, achieving high fidelity without needing extra depth inputs.
VideoComposer can generate videos with control over how they look and move using text, sketches, and motion vectors. It improves video quality by ensuring frames match well, allowing for flexible video creation and editing.
Cocktail is a pipeline for guiding image generating. Compared to ControlNet, it only requires one generalized model for multiple modalities like Edge, Pose and Mask guidance.
Make-Your-Video can generate customized videos from text and depth information for better control over content. It uses a Latent Diffusion Model to improve video quality and reduce the need for computing power.
Now motion capturing is cool. But what if you want your 3D characters to move in new and unique ways? GenMM is able to generate a variety of movements from just a single or few example sequences. Unlike other methods, it doesn’t need exhaustive training and can create new motions with complex skeletons in fractions of a second. It’s also a whiz at jobs you couldn’t do with motion matching alone, like motion completion, guided generation from keyframes, infinite looping, and motion reassembly.
[Humans in 4D] can track and reconstruct humans in 3D from a single video. It handles unusual poses and poor visibility well, using a transformer-based network called HMR 2.0 to improve action recognition.
There is a new text-to-image player called RAPHAEL in town. The model aims to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is all great, but only if someone actually releases the model for open-source consumption as the community is craving a model that can achieve Midjourney quality.
Super-Resolution of License Plate Images Using Attention Modules and Sub-Pixel Convolution Layers can enhance low-resolution license plate images. It uses attention and transformer modules to improve details and a special loss function based on Optical Character Recognition to achieve better image quality.
Break-A-Scene can extract multiple concepts from a single image using segmentation masks. It allows users to re-synthesize individual concepts or combinations in different contexts, enhancing scene generation with a two-phase customization process.
Voyager can explore the Minecraft world on its own and learn new skills. It uses an automatic curriculum to improve exploration and achieves 3.3 times more unique items and 15.3 times faster tech tree mastery compared to previous methods.
Sin3DM can generate high-quality variations of 3D objects from a single textured shape. It uses a diffusion model to learn how parts of the object fit together, enabling retargeting, outpainting, and local editing.