AI Art Weekly #64
Hello there, my fellow dreamers, and welcome to issue #64 of AI Art Weekly! ๐
The end of the year is drawing near, and with it, research is gradually slowing down. Compared to previous weeks, there were only 142 papers ๐
Before we get down to business, I wanted to thank you all for your support over the year. Itโs been one hell of a wild ride. Iโll be taking the next week off to rest up for what 2024 holds. I wish you all a wonderful holiday season and a joyful new year! ๐
Letโs dive in:
- Midjourney v6 alpha released
- VideoPoet turns LLMs into video generators
- GAvatar generates animatable 3D Gaussian avatars
- Align Your Gaussians generates dynamic 4D assets
- VidToMe edits videos with a text prompt
- PIA animates images with a text prompt
- MoSAR turns portrait into relightable 3D avatars
- And HAAR gives them hair
- Paint-it generates texture maps for 3D meshes
- Intrinsic Image Diffusion predicts materials from an image
- Splatter Image creates 3D reconstructions from videos in real-time
- RelightableAvatar turns videos into relightable 3D humans
- DreamTalk can animate faces
- and more tutorials, tools and gems!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do ๐
Cover Challenge ๐จ
For the next cover Iโm looking for everything, so do whatever you wanna do. Challenge runs two weeks so the reward is $100 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here. Iโm looking forward to your submissions ๐
News & Papers
Midjourney v6 alpha released
Midjourney released an early version of their v6 model this week. In short:
- The new model has a better prompt understanding (Example)
- Improved coherence and model knowledge (Example)
- Supports drawing some text (Example)
- Two new upscalers with both
subtle
andcreative
modes
First results look very promising, especially the quality photorealism is stunning. Prompt understanding isnโt on par with Dalle 3 yet, but itโs definitely a step in the right direction.
VideoPoet: A large language model for zero-shot video generation
Google revealed that large language models can generate videos. Their simple modeling method, VideoPoet, can convert any autoregressive language model or LLM into a high-quality video generator, capable of generating videos & audio.
Seeing a lot of capabitilies that weโve explored throughout the year come together in a single multi-modal model is super cool.
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
NVIDIA shared GAvatar this week, a new method that can generate realistic 3D Gaussian splat avatars from text that can be animated. The method can not only generate highly-detailed textured meshes, but can also render them at 100 fps with a 1K resolution.
Align Your Gaussians
Speaking of NVIDIA, they also shared Align Your Gaussians, a new method that can generate dynamic 4D assets from text prompts. It also supports the ability to create looping animations as well as chaining multiple text prompts to create changing animations.
VidToMe: Video Token Merging for Zero-Shot Video Editing
VidToMe can edit videos with a text prompt, custom models and ControlNet guidance and also achieves great temporal consistency. The critical idea in this one is to merge similar tokens across multiple frames in self-attention modules to achieve temporal consistency in generated videos.
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
PIA is yet another method that can animate images generated by custom Stable Diffusion checkpoints with realistic motions based on a text prompt.
MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading
MoSAR is able to turn a single portrait image into a relightable 3D avatar with detailed geometry and rich reflectance maps at 4K resolution.
HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles
Now that we got the face, we need some hair. HAAR can generate 3D strand-based human hairstyles from text prompts. The model is able to interpolate between different hair styles, edit and even animate them. Super cool and I canโt wait until tech like this finds its way into the next FromSoftware character creator. You can see it in action here.
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
Paint-it can generate high-fidelity physically-based rendering (PBR) texture maps for 3D meshes from a text description. The method is able to relight the mesh by changing High-Dynamic Range (HDR) environmental lighting and control the material properties at test-time.
Intrinsic Image Diffusion for Single-view Material Estimation
A challenge so far when generating 3D objects has been dealing with โbakedโ textures, which often contain excessive and static shadowing, leading to inaccuracies in dynamic lighting environments. Intrinsic Image Diffusion solves this by predicting materials and generates albedo, roughness, and metallic maps from a single image.
Splatter Image: Ultra-Fast Single-View 3D Reconstruction
Splatter Image is an ultra-fast method that can create 3D reconstructions from monocular videos or a single image a frame at a time at 38fps and render them at 588fps. Quality isnโt as high as multi-view methods, but the fact that you can turn a video instantly into a 4D scene is nuts.
Relightable and Animatable Neural Avatars from Videos
RelightableAvatar is another method that can create relightable and animatable neural avatars from monocular video.
DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models
DreamTalk is able to generate talking heads conditioned on a given text prompt. The model is able to generate talking heads in multiple languages and can also manipulate the speaking style of the generated video.
Also interesting
- pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
- MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance
- SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing
- Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting
- CrossDiff: Realistic Human Motion Generation with Cross-Diffusion Models
- HCBlur: Deep Hybrid Camera Deblurring
@toyxyz3 is experimenting with using StableZero123 to rotate characters, LCM to instantly process them and concept slider to edit attributes.
Interview
This week weโre talking to AI artist QuantumSpirit aka Jen Panepinto.
Tools & Tutorials
These are some of the most interesting resources Iโve come across this week.
StreamDiffusion is a pipeline-level solution for real-time interactive generation that can generate up to 106 frames per second on an RTX 4090 with SD-Turbo.
VolumeDiffusion is a fast and scalable text-to-3D generation method that gives you a 3D object within seconds/minutes.
CLIP-DINOiser improves the notoriously noisy MaskCLIP feature maps and produces way more smooth outputs.
Now, the masks from CLIP-DINOiser are still not perfect. This is where SegRefiner comes in, to well, refine the segmentation masks.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it ๐โค๏ธ
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday ๐ )
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
โ dreamingtulpa