AI Toolbox
A curated collection of 811 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





HORT can create detailed 3D point clouds of hand-held objects from just one photo.
AnyTop can generate motions for different characters using only their skeletal structure.
LoRA-MDM can generate stylized human motions in different styles, like “Chicken,” by using a few reference samples with a motion diffusion model. It allows for style blending and motion editing while keeping a good balance between text fidelity and style consistency.
UNO that brings subject transfer and preservation from reference image to FLUX with one single model.
TokenHSI can enable physics-based characters to interact with their environment using a unified transformer-based policy. It adapts to new situations with variable length inputs and improves knowledge sharing across tasks, making interactions more versatile.
LVSM can generate high-quality 3D views of objects and scenes from a few input images.
VideoScene can generate 3D scenes from sparse video views in one step.
AudioX can generate high-quality audio and music from text, video, images, and existing audio.
AnimeGamer can generate dynamic anime life simulations where players interact with characters using open-ended language instructions. It uses multimodal LLMs to create consistent game states and high-quality animations.
GeometryCrafter can recover detailed 3D point maps from open-world videos.
DiffPortrait360 can create high-quality 360-degree views of human heads from single images.
VACE basically adds ControlNet support to video models like Wan and LTX. It handle various video tasks like generating videos from references, video inpainting, pose control, sketch to video and more.
Perception-as-Control can achieve fine-grained motion control for image animation by creating a 3D motion representation from a reference image.
MVGenMaster can generate up to 100 new views from a single image using a multi-view diffusion model.
SegAnyMo can segment moving objects in videos without needing human labels.
On the other hand, DiffuseKronA is another method that tries to avoid having to use LoRAs and wants to personalize just from input images. This one generates high-quality images with accurate text-image correspondence and improved color distribution from diverse and complex input images and prompts.
SparseFlex can generate high-resolution 3D meshes with complex shapes and surfaces.
LeX-Art can generate high-quality text-image pairs with better text rendering and design. It uses a prompt enrichment model called LeX-Enhancer and two optimized models, LeX-FLUX and LeX-Lumina, to improve color, position, and font accuracy.
TexGaussian can generate high-quality PBR materials for 3D meshes in one step. It produces albedo, roughness, and metallic maps quickly and with great visual quality, ensuring better consistency with the input geometry.
AccVideo can speed up video diffusion models by reducing the number of steps needed for video creation. It achieves an 8.5x faster generation speed compared to HunyuanVideo, producing high-quality videos at 720x1280 resolution and 24fps, which makes text-to-video generation way more efficient.