AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





Anywhere can place any object from an input image into any suitable and diverse location in an output image. Perfect for product placement.
Make-it-Real can recognize and describe materials using GPT-4V, helping to build a detailed material library. It aligns materials with 3D object parts and creates SVBRDF materials from albedo maps, improving the realism of 3D assets.
ConsistentID can generate diverse personalized ID images from text prompts using just one reference image. It improves identity preservation with a facial prompt generator and an ID-preservation network, ensuring high quality and variety in the generated images.
And on the pose reconstruction front we have had TokenHMR, which can extract human poses and shapes from a single image.
SVA can generate sound effects and background music for videos based on a single key frame and a text prompt.
MaGGIe can efficiently predict high-quality human instance mattes from coarse binary masks for both image and video input. The method is able to output all instance mattes simultaneously without exploding memory and latency, making it suitable for real-time applications.
Similar to ConsistentID, PuLID is a tuning-free ID customization method for text-to-image generation. This one can also be used to edit images generated by diffusion models by adding or changing the text prompt.
CharacterFactory can generate endless characters that look the same across different images and videos. It uses GANs and word embeddings from celebrity names to ensure characters stay consistent, making it easy to integrate with other models.
Parts2Whole can generate customized human portraits from multiple reference images, including pose images and various aspects of human appearance. The method is able to generate human images conditioned on selected parts from different humans as control conditions, allowing you to create images with specific combinations of facial features, hair, clothes, etc.
PhysDreamer is a physics-based approach that enables you to poke, push, pull and throw objects in a virtual 3D environment and they will react in a physically plausible manner.
TF-GPH can blend images with disparate visual elements together stylistically!
FlowSAM can discover and segment moving objects in videos by combining the Segment Anything Model (SAM) with optical flow. It outperforms previous methods, achieving better object identity and sequence-level segmentation for both single and multi-object scenarios.
DG-Mesh is able to reconstruct high-quality and time-consistent 3D meshes from a single video. The method is also able to track the mesh vertices over time, which enables texture editing on dynamic objects.
AniClipart can turn static clipart images into high-quality animations. It uses Bézier curves for smooth motion and aligns movements with text prompts, improving how well the animation matches the text and maintains visual style.
CustomDiffusion360 brings camera viewpoint control to text-to-image models. Only caveat: it requires a 360 degree multi-view dataset of around 50 images per object to work.
StyleBooth is a unified style editing method supporting text-based, exemplar-based and compositional style editing. So basically, you can take an image and change its style by either giving it a text prompt or an example image.
InFusion can inpaint 3D Gaussian point clouds to restore missing 3D points for better visuals. It lets users change textures and add new objects, achieving high quality and efficiency.
IntrinsicAnything is able to recover object materials from any images and enable single-view image relighting.
VQ-Diffusion can generate high-quality images from text prompts using a vector quantized variational autoencoder and a conditional denoising diffusion model. It is up to fifteen times faster than traditional methods and handles complex scenes effectively.
MOWA is a multiple-in-one image warping model that can be used for various tasks such as rectangling panoramic images, unrolling shutter images, rotating images, fisheye images, and image retargeting.