AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Phidias can generate high-quality 3D assets from text, images, and 3D references. It uses a method called reference-augmented diffusion to improve quality and speed, achieving results in just a few seconds.
EventEgo3D++ can capture 3D human motion using a monocular event camera with a fisheye lens. It works well in low-light and high-speed conditions, providing real-time 3D pose updates at 140Hz with high accuracy compared to RGB-based methods.
Cyberpunk brain dances are becoming a thing! D-NPC can turn videos into dynamic neural point clouds aka 4D scenes which makes it possible to watch a scene from another perspective.
Distill Any Depth can generate depth maps from images.
GHOST 2.0 is a deepfake method that can transfer heads from one image to another while keeping the skin color and structure intact.
FreeTimeGS can reconstruct dynamic 3D scenes in real-time using Gaussian primitives that can appear at different times and places.
KV-Edit can edit images while keeping the background consistent. It allows users to add, remove, or change objects without needing extra training, ensuring high image quality.
Any2AnyTryon can generate high-quality virtual try-on results by transferring clothes onto images as well as reconstructing garments from real-world images.
NotaGen can generate high-quality classical sheet music.
UniCon can handle different image generation tasks using a single framework. It adapts a pretrained image diffusion model with only about 15% extra parameters and supports most base ControlNet transformations.
MatAnyone can generate stable and high-quality human video matting masks.
SongGen can generate both vocals and accompaniment from text prompts using a single-stage auto-regressive transformer. It allows users to control lyrics, genre, mood, and instrumentation, and offers mixed mode for combined tracks or dual-track mode for separate tracks.
MagicArticulate can rig static 3D models and make them ready for animation. Works on both humanoid and non-humanoid objects.
MEGASAM can estimate camera parameters and depth maps from casual monocular videos.
Step-Video-T2V can generate high-quality videos up to 204 frames long using a 30B parameter text-to-video model.
MIGE can generate images from text prompts and reference images and edit existing images based on instructions.
Cycle3D can generate high-quality and consistent 3D content from a single unposed image. This approach enhances texture consistency and multi-view coherence, significantly improving the quality of the final 3D reconstruction.
LIFe-GoM can create animatable 3D human avatars from sparse multi view images in under 1 second. It renders high-quality images at 95.1 frames per second.
DressRecon can create 3D human body models from single videos. It handles loose clothing and objects well, achieving high-quality results by combining general human shapes with specific video movements.
Google DeepMind has been researching 4DiM, a cascaded diffusion model for 4D novel view synthesis. It can generate 3D scenes with temporal dynamics from a single image and a set of camera poses and timestamps.