AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
DiffSketcher is a tool that can turn words into vectorized free-hand sketches. The method also supports the ability to define the level of abstraction, allowing for more abstract or concrete generations.
Diffusion with Forward Models is a able to reconstruct 3D scenes from a single input image. Additionally it’s also able to add small and short motions to images with people in them.
It’s said that our eyes hold the universe. When it comes to the method discussed in the paper Seeing the World through Your Eyes, they at least hold a 3D scene. The method discussed in the paper is able to reconstruct 3D scenes beyond the camera’s line-of-sight using portrait images containing eye reflections.
We’ve already seen a few attempts at bringing ControlNet to video, but getting temporal coherency right seems to be a trick issue to solve. ControlVideo is the next attempt and things start to look extremely promising.
Neuralangelo can reconstruct detailed 3D surfaces from RGB video captures. It uses multi-resolution 3D hash grids and neural surface rendering, achieving high fidelity without needing extra depth inputs.
VideoComposer can generate videos with control over how they look and move using text, sketches, and motion vectors. It improves video quality by ensuring frames match well, allowing for flexible video creation and editing.
Cocktail is a pipeline for guiding image generating. Compared to ControlNet, it only requires one generalized model for multiple modalities like Edge, Pose and Mask guidance.
Make-Your-Video can generate customized videos from text and depth information for better control over content. It uses a Latent Diffusion Model to improve video quality and reduce the need for computing power.
Now motion capturing is cool. But what if you want your 3D characters to move in new and unique ways? GenMM is able to generate a variety of movements from just a single or few example sequences. Unlike other methods, it doesn’t need exhaustive training and can create new motions with complex skeletons in fractions of a second. It’s also a whiz at jobs you couldn’t do with motion matching alone, like motion completion, guided generation from keyframes, infinite looping, and motion reassembly.
[Humans in 4D] can track and reconstruct humans in 3D from a single video. It handles unusual poses and poor visibility well, using a transformer-based network called HMR 2.0 to improve action recognition.
There is a new text-to-image player called RAPHAEL in town. The model aims to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is all great, but only if someone actually releases the model for open-source consumption as the community is craving a model that can achieve Midjourney quality.
Super-Resolution of License Plate Images Using Attention Modules and Sub-Pixel Convolution Layers can enhance low-resolution license plate images. It uses attention and transformer modules to improve details and a special loss function based on Optical Character Recognition to achieve better image quality.
Break-A-Scene can extract multiple concepts from a single image using segmentation masks. It allows users to re-synthesize individual concepts or combinations in different contexts, enhancing scene generation with a two-phase customization process.
Voyager can explore the Minecraft world on its own and learn new skills. It uses an automatic curriculum to improve exploration and achieves 3.3 times more unique items and 15.3 times faster tech tree mastery compared to previous methods.
Sin3DM can generate high-quality variations of 3D objects from a single textured shape. It uses a diffusion model to learn how parts of the object fit together, enabling retargeting, outpainting, and local editing.
Control-A-Video can generate controllable text-to-video content using diffusion models. It allows for fine-tuned customization with edge and depth maps, ensuring high quality and consistency in the videos.
Text2NeRF can generate 3D scenes from text descriptions by combining neural radiance fields (NeRF) with a text-to-image diffusion model. It creates high-quality textures and detailed shapes without needing extra training data, achieving better photo-realism and multi-view consistency than other methods.
DragGAN can manipulate images by letting users drag points to change the pose, shape, and layout of objects. It produces realistic results even when parts of the image are hidden or deformed.
DragonDiffusion can edit images by moving, resizing, and changing the appearance of objects without needing to retrain the model. It lets users drag points on images for easy and precise editing.
FastComposer can generate personalized images of multiple unseen individuals in various styles and actions without fine-tuning. It is 300x-2500x faster than traditional methods and requires no extra storage for new subjects, using subject embeddings and localized attention to keep identities clear.