AI Toolbox | AI Art Weekly

DisEnvisioner

DisEnvisioner can generate customized images from a single visual prompt and extra text instructions. It filters out irrelevant details and provides better image quality and speed without needing extra tuning.

26.10.24 · Project Page · Code · Personalized Image Generation · Image Editing

Vid2Avatar-Pro

Vid2Avatar-Pro can create photorealistic animatable 3D human avatars from single videos.

25.10.24 · Project Page · Code · 3D Avatar Generation

Taming Data and Transformers for Audio Generation

GenAu is a new scalable transformer-based audio generation architecture that is able to generate high-quality ambient sounds and effects.

24.10.24 · Project Page · Code · Audio Captioning · Personalized Audio Generation

HeadStudio

HeadStudio is another text-to-3D avatar model that can generate animatable head avatars. The method is able to produce high-fidelity avatars with smooth expression deformation and real-time rendering.

24.10.24 · Project Page · Code · Text-to-3D · 3D Object Generation

Read, Watch and Scream! Sound Generation from Text and Video

ReWaS can generate sound effects from text and video. The method is able to estimate the structural information of audio from the video while receiving key content cues from a user prompt.

21.10.24 · Project Page · Code · Text-to-SFX

DepthSplat

DepthSplat can reconstruct 3D scenes form only a few images by connecting Gaussian splatting and depth estimation.

21.10.24 · Project Page · Code · 3D Scene Generation

MonST3R

MonST3R can estimate 3D shapes from videos over time, creating a dynamic point cloud and tracking camera positions. This method improves video depth estimation and separates moving from still objects more effectively than previous techniques.

18.10.24 · Project Page · Code · Video Depth Estimation · Video-to-4D

UniPortrait

UniPortrait can customize images of one or more people with high quality. It allows for detailed face editing and uses free-form text descriptions to guide changes.

18.10.24 · Project Page · Code · Personalized Image Generation · Image Editing

F5-TTS

F5-TTS can generate natural-sounding speech using a fast text-to-speech system. It supports multiple languages, can switch between languages smoothly, and is trained on a large dataset of 100,000 hours.

18.10.24 · Project Page · Code · Text-to-Speech

MimicTalk

MimicTalk can generate personalized 3D talking faces in under 15 minutes. It mimics a person’s talking style using a special audio-to-motion model, resulting in high-quality videos.

16.10.24 · Project Page · Code · Talking Head Generation

GS^3

GS^3 can relight scenes in real-time using a triple Gaussian splatting process. It achieves high-quality lighting and view synthesis from multiple images, running at 90 fps on a single GPU.

15.10.24 · Project Page · Code · 3D Relighting

DreamWaltz-G

DreamWaltz-G can generate high-quality 3D avatars from text and animate them using SMPL-X motion sequences. It improves avatar consistency with Skeleton-guided Score Distillation and is useful for human video reenactment and creating scenes with multiple subjects.

15.10.24 · Project Page · Code · Text-to-3D · 3D Avatar Generation

Tex4D

Tex4D can generate 4D textures for untextured mesh sequences from a text prompt. It combines 3D geometry with video diffusion models to ensure the textures are consistent across different views and frames.

15.10.24 · Project Page · Code · Video-to-4D

HART

HART is an autoregressive transformer model that can generate high-quality 1024x1024 images from text 3x times faster than SD3-Medium.

15.10.24 · Project Page · Code · Demo · Text-to-Image

EfficientViT

EfficientViT can speed up high-resolution diffusion models by compressing data with a ratio of up to 128 while keeping good image quality. It achieves a 19.1x speed increase for inference and a 17.9x speed increase for training on ImageNet 512x512 compared to other autoencoders.

15.10.24 · Code · Image Restoration · Image Upscaling

Depth Any Video

Depth Any Video can generate high-resolution depth maps for videos. It uses a large dataset of 40,000 annotated clips to improve accuracy and includes a method for better depth inference across sequences of up to 150 frames.

15.10.24 · Project Page · Code · Video Depth Estimation

CtrLoRA

CtrLoRA can adapt a base ControlNet for image generation with just 1,000 data pairs in under one hour of training on a single GPU. It reduces learnable parameters by 90%, making it much easier to create new guidance conditions.

15.10.24 · Code · Controllable Image Generation

CRM

Feels like we get an image-to-3D method each week now. CRM is yet another one that can generate 3D objects from a single image. This one is able to create high-fidelity textured meshes with interactable surfaces in just 10 seconds. Results are stunning!

14.10.24 · Project Page · Code · Image-to-3D · 3D Object Generation

SceneCraft

SceneCraft can generate detailed indoor 3D scenes from user layouts and text descriptions. It is able to turn 3D layouts into 2D maps, producing complex spaces with diverse textures and realistic visuals.

14.10.24 · Project Page · Code · 3D Scene Generation · Controllable 3D Generation

TweedieMix

TweedieMix can generate images and videos that combine multiple personalized concepts.

12.10.24 · Code · Personalized Video Generation · Personalized Image Generation