AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
InterActHuman can generate videos with multiple human characters by matching audio to each person.
Assembler can reconstruct complete 3D objects from part meshes and a reference image.
FLUX-IR can restore low-quality images to high-quality ones by optimizing paths through reinforcement learning.
MTV can create high-quality videos that match audio by separating it into speech, effects, and music tracks.
MeshPad can create and edit 3D meshes from 2D sketches. Users can easily add or delete mesh parts through simple sketch changes.
StyleSculptor can generate 3D assets from a content image and style images without needing extra training.
VividFace can swap faces in videos while keeping the original person’s look and expressions. It handles challenges like keeping the face consistent over time and working well with different angles and lighting.
Mask²DiT can generate long videos with multiple scenes by aligning video segments with text descriptions.
PINO can generate realistic interactions among groups of any size by breaking down complex actions into simple pairwise motions. It uses pretrained diffusion models for two-person interactions and ensures realistic movement with physics-based rules, allowing control over character speed and position.
Motion-2-to-3 can generate realistic 3D human motions from text prompts using 2D motion data from videos. It improves motion diversity and efficiency by predicting consistent joint movements and root dynamics with a multi-view diffusion model.
OmniPart can generate 3D objects from a single image by planning their structure and then creating them.
DiffVSR can upscale and restore videos by improving their resolution while keeping details clear and stable across frames.
IntrinsiX can generate high-quality PBR maps from text descriptions. It helps with re-lighting, material editing, and texture generation, producing detailed and coherent images.
MeshMosaic can generate high-resolution 3D meshes with over 100,000 triangles. It breaks shapes into smaller patches for better detail and accuracy, outperforming other methods that usually handle only 8,000 faces.
Manipulation by Analogy can change audio textures by learning from paired speech examples. It allows users to add, remove, or replace sounds, and it works well in real-world situations beyond just speech.
Bokeh Diffusion can control defocus blur in text-to-image diffusion models by using a physical defocus blur parameter. It allows for flexible blur adjustments while preserving scene structure and supports real image editing through inversion.
Lyra can generate 3D scenes from a single image or video. It uses a method that allows real-time rendering and dynamic scene generation without needing multiple views for training.
RealisMotion can generate human videos with realistic motions by separating four key elements: the subject, background, movement path, and actions. It uses a 3D world coordinate system for better motion editing and employs text-to-video diffusion models for high-quality results.
CapStARE can achieve high accuracy in gaze estimation. It works in real-time at about 8ms per frame and handles extreme head poses well, making it ideal for interactive systems.
Follow-Your-Click can animate specific regions of an image with a simple user click and a short motion prompt, and allows to control the speed of the animation.