AI Toolbox
A curated collection of 759 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Phantom can generate videos that keep the subject’s identity from images while matching them with text prompts.
PosterMaker can generate high-quality product posters by rendering text accurately and keeping the main subject clear.
IP-Composer can generate compositional images by using multiple input images and natural language prompts.
PhysFlow can simulate dynamic interactions in complex scenes. It identifies material types through image queries and enhances realism using video diffusion and a Material Point Method for detailed 4D representations.
Hi3DGen can generate high-quality 3D shapes from 2D images. It uses a three-step process to accurately capture fine details, outperforming other methods in realism.
AniSDF can reconstruct high-quality 3D shapes with improved surface geometry. It can handle complex, luminous, reflective as well as fuzzy objects.
OmniCaptioner can generate detailed text descriptions for various types of content like images, math formulas, charts, user interfaces, pdfs, videos and more.
ReCamMaster can re-capture videos from new camera angles.
GARF can reassemble 3D objects from real-world fractured parts.
TTT-Video can create coherent one-minute videos from text storyboards. As the title of this paper says, this uses test-time training instead of self-attention layers to be able to produce consistent multi-context scenes, which is quite the achievement. The paper is worth a read.
Piece it Together can combine different visual components into complete characters or objects. It uses a lightweight flow-matching model called IP-Prior to improve prompt adherence and enable diverse, context-aware generations.
HORT can create detailed 3D point clouds of hand-held objects from just one photo.
AnyTop can generate motions for different characters using only their skeletal structure.
LoRA-MDM can generate stylized human motions in different styles, like “Chicken,” by using a few reference samples with a motion diffusion model. It allows for style blending and motion editing while keeping a good balance between text fidelity and style consistency.
UNO that brings subject transfer and preservation from reference image to FLUX with one single model.
TokenHSI can enable physics-based characters to interact with their environment using a unified transformer-based policy. It adapts to new situations with variable length inputs and improves knowledge sharing across tasks, making interactions more versatile.
LVSM can generate high-quality 3D views of objects and scenes from a few input images.
VideoScene can generate 3D scenes from sparse video views in one step.
AudioX can generate high-quality audio and music from text, video, images, and existing audio.
AnimeGamer can generate dynamic anime life simulations where players interact with characters using open-ended language instructions. It uses multimodal LLMs to create consistent game states and high-quality animations.