AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
DragAnything can control the motion of any object in videos by letting users draw trajectory lines. It allows for separate motion control of multiple objects, including backgrounds.
DEADiff can synthesize images that combine the style of a reference image with text prompts. It uses a Q-Former mechanism to separate style and meaning.
VideoElevator is a training-free and plug-and-play method that can be used to enhance temporal consistency and add more photo-realistic details of text-to-video models by using text-to-image models.
ELLA is a lightweight approach to equip existing CLIP-based diffusion models with LLMs to improve prompt-understanding and enables long dense text comprehension for text-to-image models.
SplattingAvatar can generate photorealistic real-time human avatars using a mix of Gaussian Splatting and triangle mesh geometry. It achieves over 300 FPS on modern GPUs and 30 FPS on mobile devices, allowing for detailed appearance modeling and various animation techniques.
The PixArt model family got a new addition with PixArt-Σ. The model is capable of directly generating images at 4K resolution. Compared to its predecessor, PixArt-α, it offers images of higher fidelity and improved alignment with text prompts.
UniCtrl can improve the quality and consistency of videos made by text-to-video models. It enhances how frames connect and move together without needing extra training, making videos look better and more diverse in motion.
TripoSR can generate high-quality 3D meshes from a single image in under 0.5 seconds.
ResAdapter can generate images with any resolution and aspect ratio for diffusion models. It works with various personalized models and processes images efficiently, using only 0.5M parameters while keeping the original style.
ViewDiff is a method that can generate high-quality, multi-view consistent images of a real-world 3D object in authentic surroundings from a single text prompt or a single posed image.
While LCM and Turbo have unlocked near real-time image diffusion, the quality is still a bit lacking. TCD on the other hand manages to generate images with both clarity and detailed intricacy without compromising on speed.
OHTA can create detailed and usable hand avatars from just one image. It allows for text-to-avatar conversion and editing of hand textures and shapes, using data-driven hand priors to improve accuracy with limited input.
SongComposer can generate both lyrics and melodies using symbolic song representations. It aligns lyrics and melodies precisely and outperforms advanced models like GPT-4 in creating songs.
GEM3D is a deep, topology-aware generative model of 3D shapes. The method is able to generate diverse and plausible 3D shapes from user-modeled skeletons, making it possible to draw the rough structure of an object and have the model fill in the rest.
Multi-LoRA Composition focuses on the integration of multiple Low-Rank Adaptations (LoRAs) to create highly customized and detailed images. The approach is able to generate images with multiple elements without fine-tuning and without losing detail or image quality.
MeshFormer can generate high-quality 3D textured meshes from just a few 2D images in seconds.
SPA-RP can create 3D textured meshes and estimate camera positions from one or a few 2D images. It uses 2D diffusion models to quickly understand 3D space, achieving high-quality results in about 20 seconds.
SCG can be used by musicians to compose and improvise new piano pieces. It allows musicians to guide music generation by using rules like following a simple I-V chord progression in C major. Pretty cool.
FlashTex](https://flashtex.github.io) can texture an input 3D mesh given a user-provided text prompt. These generated textures can also be relit properly in different lighting environments.
Visual Style Prompting can generate images with a specific style from a reference image. Compared to other methods like IP-Adapter and LoRAs, Visual Style Prompting is better at retainining the style of the referenced image while avoiding style leakage from text prompts.