AI Toolbox
A curated collection of 970 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
NormalCrafter can generate consistent surface normals from video sequences. It uses video diffusion models and Semantic Feature Regularization to ensure accurate normal estimation while keeping details clear across frames.
TTT-Video can create coherent one-minute videos from text storyboards. As the title of this paper says, this uses test-time training instead of self-attention layers to be able to produce consistent multi-context scenes, which is quite the achievement. The paper is worth a read.
Piece it Together can combine different visual components into complete characters or objects. It uses a lightweight flow-matching model called IP-Prior to improve prompt adherence and enable diverse, context-aware generations.
HORT can create detailed 3D point clouds of hand-held objects from just one photo.
AnyTop can generate motions for different characters using only their skeletal structure.
LoRA-MDM can generate stylized human motions in different styles, like “Chicken,” by using a few reference samples with a motion diffusion model. It allows for style blending and motion editing while keeping a good balance between text fidelity and style consistency.
UNO that brings subject transfer and preservation from reference image to FLUX with one single model.
TokenHSI can enable physics-based characters to interact with their environment using a unified transformer-based policy. It adapts to new situations with variable length inputs and improves knowledge sharing across tasks, making interactions more versatile.
LVSM can generate high-quality 3D views of objects and scenes from a few input images.
VideoScene can generate 3D scenes from sparse video views in one step.
AudioX can generate high-quality audio and music from text, video, images, and existing audio.
AnimeGamer can generate dynamic anime life simulations where players interact with characters using open-ended language instructions. It uses multimodal LLMs to create consistent game states and high-quality animations.
GeometryCrafter can recover detailed 3D point maps from open-world videos.
DiffPortrait360 can create high-quality 360-degree views of human heads from single images.
VACE basically adds ControlNet support to video models like Wan and LTX. It handle various video tasks like generating videos from references, video inpainting, pose control, sketch to video and more.
Perception-as-Control can achieve fine-grained motion control for image animation by creating a 3D motion representation from a reference image.
MVGenMaster can generate up to 100 new views from a single image using a multi-view diffusion model.
SegAnyMo can segment moving objects in videos without needing human labels.
On the other hand, DiffuseKronA is another method that tries to avoid having to use LoRAs and wants to personalize just from input images. This one generates high-quality images with accurate text-image correspondence and improved color distribution from diverse and complex input images and prompts.
SparseFlex can generate high-resolution 3D meshes with complex shapes and surfaces.