AI Toolbox
A curated collection of 965 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
Point-E can generate 3D point clouds from text prompts in 1-2 minutes on a single GPU. It uses a text-to-image diffusion model to create a view and then a second diffusion model to produce the point cloud, offering a faster option for 3D object generation.
MAGVIT can perform video synthesis tasks like inpainting, outpainting, and generating animations from single images. It is much faster than other models, working 100 times quicker than diffusion models and 60 times faster than autoregressive models, while also achieving the best results on multiple benchmarks.
CLIPascene can convert scene images into sketches with different levels of detail and simplicity. Users can create a range of sketches, from detailed to simple, allowing for personalized artistic expression.
3D Neural Field Generation using Triplane Diffusion can create high-quality 3D models from 2D images. It uses a diffusion model to turn ShapeNet meshes into continuous occupancy fields, achieving top results in 3D generation for various object types.
TextureDreamer can transfer detailed textures from just 3 to 5 images to any 3D shape. It uses a method called geometry-aware score distillation to improve texture quality beyond previous techniques.
Latent-NeRF can generate 3D shapes and textures by combining text and shape guidance. It uses latent score distillation to apply this guidance directly on 3D meshes, allowing for high-quality textures on specific geometries.
VectorFusion can generate SVG-exportable vector graphics from text prompts. It uses a text-conditioned diffusion model to create high-quality outputs in various styles, like pixel art and sketches, without needing large datasets of captioned SVGs.
InstructPix2Pix can edit images based on written instructions. It allows users to add or remove objects, change colors, and transform styles quickly, using a conditional diffusion model trained on a large dataset.
MinD-Vis can create realistic images from brain recordings using a method that combines Sparse Masked Brain Modeling and a Double-Conditioned Latent Diffusion Model. It achieves top performance in understanding thoughts and generating images, surpassing previous results by 66% in semantic mapping and 41% in image quality, while needing very few paired examples.
I Hear Your True Colors: Image Guided Audio Generation can generate audio that matches images using a two-stage Transformer model. It produces high-quality sound and introduces the ImageHear dataset for testing future image-to-audio models.
One-2-3-45 can generate a complete 360-degree 3D textured mesh from a single image in just 45 seconds. It uses a view-conditioned 2D diffusion model to create multiple images, resulting in better geometry and consistency than other methods.
MotionBERT can recover 3D human motion from noisy 2D observations. It excels in 3D pose estimation, action recognition, and motion prediction, achieving the lowest pose estimation error when trained from scratch.
EVA3D can generate high-quality 3D human models from 2D image collections. It uses a method called compositional NeRF for detailed shapes and textures, and it improves learning with pose-guided sampling.
VToonify can create high-quality artistic portrait videos from images. It allows for controllable style transfer on non-aligned faces and produces smooth, coherent videos with flexible controls on color and intensity.
AudioLM can generate high-quality audio by treating it like a language task. It produces coherent speech and piano music continuations while keeping the speaker’s voice and style consistent, even for new speakers.
Splatter Image can reconstruct a 4D video from a single image at 38 frames per second and render them at 588 frames per second.
ARF: Artistic Radiance Fields can transfer the style of a 2D image to a 3D scene by stylizing radiance fields. It captures style details while ensuring that different views of the scene look consistent, resulting in high-quality 3D content that closely matches the original style image.
MCVD can generate videos and predict future and past frames using a masked conditional score-based diffusion model. It achieves high quality and diversity in generated frames, excelling in various video synthesis tasks.
Adobe is entering the image-to-3D game. LRM can create high-fidelity 3D object meshes from a single image in just 5 seconds. The model is trained on massive multi-view data containing around 1 million objects. The results are pretty impressive and the method is able to generalize well to real-world pictures and images from generative models.
Even though Gaussian Splats have seen a lot of love, NeRFs haven’t been abandoned. This week we got three different NeRF editing papers. The first two are about inpainting. InseRF and GO-NeRF are both methods to insert 3D objects into NeRF scenes.