AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
VectorFusion can generate SVG-exportable vector graphics from text prompts. It uses a text-conditioned diffusion model to create high-quality outputs in various styles, like pixel art and sketches, without needing large datasets of captioned SVGs.
InstructPix2Pix can edit images based on written instructions. It allows users to add or remove objects, change colors, and transform styles quickly, using a conditional diffusion model trained on a large dataset.
MinD-Vis can create realistic images from brain recordings using a method that combines Sparse Masked Brain Modeling and a Double-Conditioned Latent Diffusion Model. It achieves top performance in understanding thoughts and generating images, surpassing previous results by 66% in semantic mapping and 41% in image quality, while needing very few paired examples.
I Hear Your True Colors: Image Guided Audio Generation can generate audio that matches images using a two-stage Transformer model. It produces high-quality sound and introduces the ImageHear dataset for testing future image-to-audio models.
One-2-3-45 can generate a complete 360-degree 3D textured mesh from a single image in just 45 seconds. It uses a view-conditioned 2D diffusion model to create multiple images, resulting in better geometry and consistency than other methods.
MotionBERT can recover 3D human motion from noisy 2D observations. It excels in 3D pose estimation, action recognition, and motion prediction, achieving the lowest pose estimation error when trained from scratch.
EVA3D can generate high-quality 3D human models from 2D image collections. It uses a method called compositional NeRF for detailed shapes and textures, and it improves learning with pose-guided sampling.
VToonify can create high-quality artistic portrait videos from images. It allows for controllable style transfer on non-aligned faces and produces smooth, coherent videos with flexible controls on color and intensity.
AudioLM can generate high-quality audio by treating it like a language task. It produces coherent speech and piano music continuations while keeping the speaker’s voice and style consistent, even for new speakers.
Splatter Image can reconstruct a 4D video from a single image at 38 frames per second and render them at 588 frames per second.
ARF: Artistic Radiance Fields can transfer the style of a 2D image to a 3D scene by stylizing radiance fields. It captures style details while ensuring that different views of the scene look consistent, resulting in high-quality 3D content that closely matches the original style image.
MCVD can generate videos and predict future and past frames using a masked conditional score-based diffusion model. It achieves high quality and diversity in generated frames, excelling in various video synthesis tasks.
Adobe is entering the image-to-3D game. LRM can create high-fidelity 3D object meshes from a single image in just 5 seconds. The model is trained on massive multi-view data containing around 1 million objects. The results are pretty impressive and the method is able to generalize well to real-world pictures and images from generative models.
Even though Gaussian Splats have seen a lot of love, NeRFs haven’t been abandoned. This week we got three different NeRF editing papers. The first two are about inpainting. InseRF and GO-NeRF are both methods to insert 3D objects into NeRF scenes.
[Temporal Residual Jacobians] can transfer motion from one 3D mesh to another without needing rigging or shape keyframes. It uses two neural networks to predict changes, allowing for realistic motion transfer across different body shapes.
UnZipLoRA can break down an image into its subject and style. This makes it possible to create variations and apply styles to new subjects.
SDEdit can generate and edit photo-realistic images using user-guided inputs like hand-drawn strokes or text prompts. It outperforms GAN-based methods, achieving high scores in realism and overall satisfaction without needing specific training.
[Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries] can retrieve high-quality sound effects from a single video frame without needing text metadata. It uses a combination of large language models and contrastive learning to match sound effects to video better than existing methods.
GFPGAN can restore realistic facial details from low-quality images using a pretrained face GAN. It works well on both synthetic and real-world images, allowing for quick restoration with just one pass, unlike older methods.