AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





InstantStyle can separate style and content from images in text-to-image generation without tuning. It improves visual style by using features from reference images while keeping text control and preventing style leaks.
CameraCtrl can control camera angles and movements in text-to-video generation. It improves video storytelling by adding a camera module to existing video diffusion models, making it easier to create dynamic scenes from text and camera inputs.
EDTalk can create talking face videos with control over mouth shapes, head poses, and emotions. It uses an Efficient Disentanglement framework to enhance realism by manipulating facial movements through three separate areas.
CosmicMan can generate high-quality, photo-realistic human images that match text descriptions closely. It uses a unique method called Annotate Anyone and a training framework called Decomposed-Attention-Refocusing (Daring) to improve the connection between text and images.
Following spatial instructions in text-to-image prompts is hard! SPRIGHT-T2I can finally do it though, resulting in more coherent and accurate compositions.
ProbTalk is a method for generating lifelike holistic co-speech motions for 3D avatars. The method is able to generate a wide range of motions and ensures a harmonious alignment among facial expressions, hand gestures, and body poses.
ID2Reflectance can generate high-quality facial reflectance maps from a single image.
Motion Inversion can be used to customize the motion of videos by matching the motion of a different video.
DSTA is a method for video-based human pose estimation which is able to directly map input to output joint coordinates.
GaussianCube is a image-to-3D model that is able to generate high-quality 3D objects from multi-view images. This one also uses 3D Gaussian Splatting, converts the unstructured representation into a structured voxel grid, and then trains a 3D diffusion model to generate new objects.
Garment3DGen can stylize the geometry and textures from 2D image and 3D mesh garments! These can be fitted on top of parametric bodies and simulated. Could be used for hand-garment interaction in VR or to turn sketches into 3D garments.
MonoHair can create high-quality 3D hair from a single video. It uses a two-step process for detailed hair reconstruction and achieves top performance across various hairstyles.
Learning Inclusion Matching for Animation Paint Bucket Colorization can colorize line art in animations by allowing artists to colorize just one frame. The algorithm then automatically applies the color to the rest of the frames, using a learning-based inclusion matching pipeline for more accurate results.
AiOS can estimate human poses and shapes in one step, combining body, hand, and facial expression recovery.
PAID is a method that enables smooth high consistency image interpolation for diffusion models. GANs have been the king in that field so far, but this method shows promising results for diffusion models.
TC4D can animate 3D scenes generated from text along arbitrary trajectories. I can see this being useful for generating 3D effects for movies or games.
TRAM can reconstruct human motion and camera movement from videos in dynamic settings. It reduces global motion errors by 60% and uses a video transformer model to accurately track body motion.
Attribute Control enables fine-grained control over attributes of specific subjects in text-to-image models. This lets you modify attributes like age, width, makeup, smile and more for each subject independently.
FlashFace can personalize photos by using one or a few reference face images and a text prompt. It keeps important details like scars and tattoos while balancing text and image guidance, making it useful for face swapping and turning virtual characters into real people.
TRIP is a new approach to image-to-video generation with better temporal coherence.