AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





NOVA-3D can generate 3D anime characters from non-overlapped front and back views.
Images that Sound can generate spectrograms that look like natural images and produce matching audio when played. It uses pre-trained diffusion models to create these spectrograms based on specific audio and visual prompts.
Slicedit can edit videos with a simple text prompt that retains the structure and motion of the original video while adhering to the target text.
ViViD can transfer a clothing item onto the video of a target person. The method is able to capture garment details and human posture, resulting in more coherent and lifelike videos.
FIFO-Diffusion can generate infinitely long videos from text without extra training. It uses a unique method that keeps memory use constant, no matter the video length, and works well on multiple GPUs.
CondMDI can generate precise and diverse motions that conform to flexible user-specified spatial constraints and text descriptions. This enables the creation of high-quality animations from just text prompts and inpainting between keyframes.
SignLLM is the first multilingual Sign Language Production (SLP) model. It can generate sign language gestures from input text or prompts and achieve state-of-the-art performance on SLP tasks across eight sign languages.
Toon3D can generate 3D scenes from two or more cartoon drawings. It’s far from perfect, but still pretty cool!
Analogist can enhance images by colorizing, deblurring, denoising, improving low-light quality, and transferring styles using a text-to-image diffusion model. It uses both visual and text prompts without needing extra training, making it a flexible tool for learning with few examples.
Dual3D is yet another text-to-3D method that can generate high-quality 3D assets from text prompts in only 1 minute.
StableMoFusion is a method for human motion generation that is able to eliminate foot-skating and create stable and efficient animations. The method is based on diffusion models and can be used for real-time scenarios such as virtual characters and humanoid robots.
SwapTalk can transfer a user’s avatar’s facial features onto a video while lip-syncing to chosen audio. It improves video quality and lip-sync accuracy, making the results more consistent than other methods.
An Empty Room is All We Want can remove furniture from indoor panorama images even Jordan Peterson would be proud. Perfect to see how your or the apartment you’re looking at would look like without all the clutter.
DreamScene4D can generate dynamic 4D scenes from single videos. It tracks object motion and handles complex movements, allowing for accurate 2D point tracking by converting 3D paths to 2D.
Pair Customization can customize text-to-image models by learning style differences from a single image pair. It separates style and content into different weight spaces, allowing for effective style application without overfitting to specific images.
StoryDiffusion can generate long-range images and videos that are able to maintain consistent content across a series of generated frames. The method is able to convert a text-based story into a video with smooth transitions and consistent subjects.
X-Oscar can generate high-quality 3D avatars from text prompts. It uses a step-by-step process for geometry, texture, and animation, while addressing issues like low quality and oversaturation through advanced techniques.
Invisible Stitch can inpaint missing depth information in a 3D scene, resulting in improved geometric coherence and smoother transitions between frames.
VimTS can extract text from images and videos, improving how well it works across different types of media.
DGE is a Gaussian Splatting method that can be used to edit 3D objects and scenes based on text prompts.