AI Toolbox
A curated collection of 811 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





SceneFactor generates 3D scenes from text using an intermediate 3D semantic map. This map can be edited to add, remove, resize, and replace objects, allowing for easy regeneration of the final 3D scene.
LT3SD can generate large-scale 3D scenes using a method that captures both basic shapes and fine details. It allows for flexible output sizes and produces high-quality scenes, even completing missing parts of a scene.
ReStyle3D can transfer the look of a style image to real-world scenes from different angles. It keeps the structure and details intact, making it great for interior design and virtual staging.
BAGEL is a unified multimodal model that can understand and generate images and text, excelling in tasks like image editing and predicting future frames. Basically the open-source version of GPT-4o.
Uni3C is a video generation method that adds support for both camera controls and human motion in video generation.
4K4DGen can turn a single panorama image into an immersive 4D environment with 360-degree views at 4K resolution. The method is able to animate the scene and optimize a set of 4D Gaussians using efficient splatting techniques for real-time exploration.
PixelHacker can perform image inpainting with strong consistency in structure and meaning. It uses a diffusion-based model and a dataset of 14 million image-mask pairs, achieving better results than other methods in texture, shape, and color consistency.
MVPainter can generate high-quality 3D textures by aligning reference textures with geometry.
MoCha can generate talking character animations from speech and text, allowing for multi-character conversations with turn-based dialogue.
RealisDance-DiT can generate high-quality character animations from images and pose sequences. It effectively handles challenges like character-object interactions and complex gestures while using minimal changes to the Wan-2.1 video model and is part of the Uni3C method.
RealCam-I2V can generate high-quality videos from real-world images with consistent parameter camera controls.
HunyuanPortrait can animate characters from a single portrait image by using facial expressions and head poses from video clips. It achieves lifelike animations with high consistency and control, effectively separating appearance and motion.
Custom SVG can generate high-quality SVGs from text prompts with customizable styles.
ObjectCarver can segment, reconstruct, and separate 3D objects from a single view using just user-input clicks, eliminating the need for segmentation masks.
Marigold can estimate depth, predict surface normals, and decompose images with minimal changes.
MTVCrafter can generate high-quality human image animations from 3D motion sequences.
PA-VDM can generate high-quality videos up to 1 minute long at 24 frames per second.
LegoGPT can generate stable and buildable LEGO designs from text prompts. It uses physics-aware techniques to ensure designs are safe for manual assembly and robotic construction, and it can create colored and textured models.
SVAD can generate high-quality 3D avatars from a single image. It keeps the person’s identity and details consistent across different poses and angles while allowing for real-time rendering.
PrimitiveAnything can generate high-quality 3D shapes from 3D models, text and images by breaking down complex forms into simple geometric parts. It uses a shape-conditioned primitive transformer to ensure that the shapes remain accurate and diverse.