AI Toolbox
A curated collection of 611 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
SpatialTracker can track 2D pixels in 3D space, even when objects are blocked or rotated. It uses depth estimators and a triplane representation to achieve top performance in difficult situations.
MuDI can generate high-quality images of multiple subjects without mixing their identities. It has a 2x higher success rate for personalizing images and is preferred by over 70% of users in evaluations.
NeRF2Physics can predict the physical properties (mass, friction, hardness, thermal conductivity and Young’s modulus) of objects from a collection of images. This makes it possible to simulate the physical behavior of digital twins in a 3D scene.
InstructHumans can edit existing 3D human textures using text prompts. It maintains avatar consistency pretty well and enables easy animation.
LCM-Lookahead is another attempted LoRA killer with an LCM-based approach for identity transfer in text-to-image generations.
InstantStyle can separate style and content from images in text-to-image generation without tuning. It improves visual style by using features from reference images while keeping text control and preventing style leaks.
CameraCtrl can control camera angles and movements in text-to-video generation. It improves video storytelling by adding a camera module to existing video diffusion models, making it easier to create dynamic scenes from text and camera inputs.
EDTalk can create talking face videos with control over mouth shapes, head poses, and emotions. It uses an Efficient Disentanglement framework to enhance realism by manipulating facial movements through three separate areas.
CosmicMan can generate high-quality, photo-realistic human images that match text descriptions closely. It uses a unique method called Annotate Anyone and a training framework called Decomposed-Attention-Refocusing (Daring) to improve the connection between text and images.
Following spatial instructions in text-to-image prompts is hard! SPRIGHT-T2I can finally do it though, resulting in more coherent and accurate compositions.
ProbTalk is a method for generating lifelike holistic co-speech motions for 3D avatars. The method is able to generate a wide range of motions and ensures a harmonious alignment among facial expressions, hand gestures, and body poses.
ID2Reflectance can generate high-quality facial reflectance maps from a single image.
Motion Inversion can be used to customize the motion of videos by matching the motion of a different video.
DSTA is a method for video-based human pose estimation which is able to directly map input to output joint coordinates.
GaussianCube is a image-to-3D model that is able to generate high-quality 3D objects from multi-view images. This one also uses 3D Gaussian Splatting, converts the unstructured representation into a structured voxel grid, and then trains a 3D diffusion model to generate new objects.
Garment3DGen can stylize the geometry and textures from 2D image and 3D mesh garments! These can be fitted on top of parametric bodies and simulated. Could be used for hand-garment interaction in VR or to turn sketches into 3D garments.
MonoHair can create high-quality 3D hair from a single video. It uses a two-step process for detailed hair reconstruction and achieves top performance across various hairstyles.
Learning Inclusion Matching for Animation Paint Bucket Colorization can colorize line art in animations by allowing artists to colorize just one frame. The algorithm then automatically applies the color to the rest of the frames, using a learning-based inclusion matching pipeline for more accurate results.
AiOS can estimate human poses and shapes in one step, combining body, hand, and facial expression recovery.
PAID is a method that enables smooth high consistency image interpolation for diffusion models. GANs have been the king in that field so far, but this method shows promising results for diffusion models.