AI Toolbox
A curated collection of 858 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Tora can generate high-quality videos with precise control over motion trajectories by integrating textual, visual, and trajectory conditions. It achieves high motion fidelity and allows for diverse video durations, aspect ratios, and resolutions, making it a versatile tool for video generation.
ObjectClear can remove objects from images while also getting rid of shadows and reflections. It uses an object-effect attention mechanism to improve how well it removes foregrounds and keeps backgrounds, making it much better than other methods, especially in complex scenes.
SketchSeg can segment raster sketches into layers, making it easy for artists to move, copy, or delete objects.
ReFlex can change the high-level features of an image based on a text prompt while keeping its main structure.
LongAnimation can create long-term animations with consistent colors.
Depth Anything at Any Condition can estimate depth from a single image in different lighting and weather conditions.
SketchColour can turn 2D animation sketches into fully colored frames.
Calligrapher can customize text images with artistic typography and a style injection framework.
SMS is a method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge.
METEOR can generate orchestral music while allowing control over the texture of the accompaniment. It achieves high-quality music style transfer and lets users adjust melodies and textures at the bar and track levels.
ReferDINO can segment objects in videos using text descriptions. It improves accuracy with a special mask decoder and enhances understanding of movement over time.
XVerse can create high-quality images with multiple subjects that can be edited. It allows precise control over each subject’s pose, style, and lighting, while also reducing issues like attribute entanglement and artifacts.
Matrix-Game can generate high-quality interactive game worlds in Minecraft.
OmniAvatar can generate lifelike full-body avatar videos from audio. It offers accurate lip-syncing and natural movements, and allows for precise control over emotions and backgrounds.
GaVS can stabilize videos by reconstructing and rendering them in 3D.
Text-Aware Image Restoration can restore images and retain the accuracy of text in them.
ControlMM can generate high-quality motion in real-time by using spatial control signals in a motion model. It is 20 times faster than other methods and can control body parts, timelines, and avoid obstacles.
OmniSep can isolate clean soundtracks from mixed audio using text, images, and audio queries.
SwiftEdit can edit images quickly using text prompts in just 0.23 seconds.
LayoutVLM can generate 3D layouts from text instructions. It improves how well layouts match the intended design and works effectively in crowded spaces.