AI Toolbox
A curated collection of 719 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





FloVD can generate camera-controllable videos using optical flow maps to show motion.
MotionMatcher can customize text-to-video diffusion models using a reference video to transfer motion and camera framing to different scenes.
LayerAnimate can animate single anime frames from text prompts or interpolate between two frames with or without sketch-guidance. It allows users to adjust foreground and background elements separately.
StyleMaster can stylize videos by transferring artistic styles from images while keeping the original content clear.
MagicMotion can animate objects in videos by controlling their paths with masks, bounding boxes, and sparse boxes.
DeepMesh can generate high-quality 3D meshes from point clouds and images.
ARTalk can generate realistic 3D head motions, including lip synchronization, blinking, and facial expressions, from audio in real-time.
InfiniteYou can generate high-quality images with FLUX and retain a person’s identity.
StarVector can generate scalable vector graphics (SVG) code from pixel images.
Diptych Prompting can generate images of new subjects in specific contexts by treating text-to-image generation as an inpainting task.
MotionStreamer can generate human motions based on text prompts and supports motion composition and longer motion generation. Also has a Blender plugin.
Thera can upscale images to super-resolution using with neural heat fields that model a precise point spread function. This method allows for correct anti-aliasing at any output size.
DreamRenderer extends FLUX with image content control using bounding boxes or masks.
InterMask can generate high-quality 3D human interactions from text descriptions. It captures complex movements between two people while also allowing for reaction generation without changing the model.
Photometric Inverse Rendering can figure out light positions and reflections in images, including tricky shadows. The method it employs breaks down surface reflections better than other tools, working well on both fake and real pictures.
KDTalker can generate high-quality talking portraits from a single image and audio input. It captures fine facial details and achieves excellent lip synchronization using a 3D keypoint-based approach and a spatiotemporal diffusion model.
TreeMeshGPT can generate detailed 3D meshes from point clouds using Autoregressive Tree Sequencing. This technique allows for better mesh detail and achieves a 22% reduction in data size during processing.
Mobius can generate seamlessly looping videos from text descriptions.
DART can generate high-quality human motions in real-time, achieving over 300 frames per second on a single RTX 4090 GPU. It combines text inputs with spatial constraints, allowing for tasks like reaching waypoints and interacting with scenes.
MelQCD can create realistic audio tracks that match silent videos. It achieves high quality and synchronization by breaking down mel-spectrograms into different signal types and using a video-to-all (V2X) predictor.