AI Toolbox
A curated collection of 849 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Face Anon can anonymize faces in images while keeping original facial expressions and head positions. It uses diffusion models to achieve high-quality image results and can also perform face swapping tasks.
CityGaussianV2 can reconstruct large-scale scenes from multi-view RGB images with high accuracy.
Self-Supervised Any-Point Tracking by Contrastive Random Walks can track any point in a video using a self-supervised global matching transformer.
MOFT is a training-free video motion interpreter and controller. It can be used to extract motion information from video diffusion models and guide the motion of generated videos without the need for retraining.
PF3plat can generate photorealistic images and accurate camera positions from uncalibrated image collections.
ScalingConcept can enhance or suppress existing concepts in images and audio without adding new elements. It can generate poses, enhance object stitching and reduce fuzziness in anime productions.
NoPoSplat can reconstruct 3D Gaussian scenes from multi-view. It achieves real-time reconstruction and high-quality images, especially when there are few input images.
ControlAR adds controls like edges, depths, and segmentation masks to autoregressive models like LlamaGen.
State of the art diffusion models are trained on square images. FiT is a new transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios (similar to what Sora does). This enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping.
From Text to Pose to Image can generate high-quality images from text prompts by first creating poses and then using them to guide image generation. This method improves control over human poses and enhances image fidelity in diffusion models.
GANs aren’t dead yet. SphereHead generates stable and high-quality 3D full-head human faces from all angles with significantly fewer artifacts compared to previous methods. Best one I’ve seen so far.
MoGe can turn images and videos into 3D point maps.
FreCaS can generate high-resolution images quickly using a method that breaks the process into stages with increasing detail. It is about 2.86× to 6.07× faster than other tools for creating 2048×2048 images and improves image quality significantly.
Stable-Hair can robustly transfer a diverse range of real-world hairstyles onto user-provided faces for virtual hair try-on. It employs a two-stage pipeline that includes a Bald Converter for hair removal and specialized modules for high-fidelity hairstyle transfer.
TANGO can generate high-quality body-gesture videos that match speech audio from a single video. It improves realism and synchronization by fixing audio-motion misalignment and using a diffusion model for smooth transitions.
MagicTailor can reuse specific parts of images in text-to-image diffusion models. It improves image quality and keeps the subject’s identity clear while reducing semantic pollution.
DisEnvisioner can generate customized images from a single visual prompt and extra text instructions. It filters out irrelevant details and provides better image quality and speed without needing extra tuning.
GenAu is a new scalable transformer-based audio generation architecture that is able to generate high-quality ambient sounds and effects.
HeadStudio is another text-to-3D avatar model that can generate animatable head avatars. The method is able to produce high-fidelity avatars with smooth expression deformation and real-time rendering.
ReWaS can generate sound effects from text and video. The method is able to estimate the structural information of audio from the video while receiving key content cues from a user prompt.