AI Art Weekly #63
Hello there, my fellow dreamers, and welcome to issue #63 of AI Art Weekly! 👋
Another week, another 187 papers skimmed through. I’m a bit short on time today, so I’m going to keep this intro short. I hope you enjoy this issue and I’ll see you next week! 🙏
The highlights of this week are:
- Stable Zero123
- MinD-3D turns brain waves into 3D objects 🧠
- W.A.L.T: a new photorealistic video generation method
- Upscale-A-Video: video upscaling with text prompts
- Peekaboo: bounding box guided video generation
- Customizing Motion can apply motion patterns from videos
- Improved temporal consistency with FreeInit
- DreaMoving: another Animate Anyone approach
- ASH: 3D human rendering in real time
- SO-SMPL: Generated disentagled human body and cloth meshes
- DiffusionLight creates HDR maps for images
- GMTalker can control facial expressions in videos
- SMERF can render large photorealistic NeRF scenes in real time
- and more tutorials, tools and gems!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do 🙏
Cover Challenge 🎨
For next weeks cover I’m looking for “nisse”, Santa’s little secret helpers. The reward is $50 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here. I’m looking forward to your submissions 🙏
News & Papers
Stable Zero123: Quality 3D Object Generation from Single Images
Stability AI continues with their model release streak. This week they shared the weights for Stable Zero123, a new image-to-3D model. The quality looks amazing, but there isn’t a Colab nor a HuggingFace space to give this a try. Will this model finally be able to bring me into 3D space?
W.A.L.T: Photorealistic Video Generation with Diffusion Models
Holy smokes, this one looks smooth. W.A.L.T is a method for photorealistic video generation for diffusion models. Unfortunately, with everything from Google Research, this one will probably never be open-sourced 😢
Upscale-A-Video
Image upscalers have been all the rage lately, but what about video upscaling? Upscale-A-Video is able to take low-resolution videos and text prompts as input and then upscale the video to a higher resolution. The method also allows for texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality.
Peekaboo: Interactive Video Generation via Masked-Diffusion
Speaking about video, more research is being conducted on motion control. Peekaboo allows to control the position, size and trajectory of an object very precisely through bounding boxes.
Customizing Motion in Text-to-Video Diffusion Models
On the other hand, Customizing Motion can learn and generalize input motion patterns from input videos and apply them to new and unseen contexts.
FreeInit: Bridging Initialization Gap in Video Diffusion Models
But what about more temporal consistency? FreeInit got us covered. It improves temporal consistency of videos generated by diffusion models and methods like AnimateDiff. Best thing? It doesn’t require any additional training and has already been open-sourced.
DreaMoving: A Human Video Generation Framework based on Diffusion Models
The Animate Anyone saga continues. DreaMoving is yet another approach at generating high-quality videos of humans given a text prompt and some pose guidance. In this case, a reference image is used to preserve facial identity.
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering
2D animations are cool, but ASH can basically do the same but in 3D using Gaussian Splats. Given some photorealistic multi-view images of a human and skeletal pose guidance, the method can render humans in real time while preserving details and wrinkles in clothes. Wild.
SO-SMPL: Disentangled Clothed Avatar Generation from Text Descriptions
SO-SMPL is also about 3D avatars, but it’s a bit different. This one focuses on generating high-quality separated human body and clothes meshes from text prompts. These disentangled avatar representations achieve much more photorealistic animations compared with other methods.
MinD-3D: Reconstruct High-quality 3D objects in Human Brain
All of the above is cool and all, then again, MinD-3D can reconstruct 3D objects from fMRI brain signals. Not super high-fidelity yet, but if this isn’t the future, I don’t know what is.
DiffusionLight: Light Probes for Free by Painting a Chrome Ball
DiffusionLight can estimate the lighting in a single input image and convert it into an HDR environment map. The technique is able to generate multiple chrome balls with varying exposures for HDR merging and can be used to seamlessly insert 3D objects into an existing photograph. Pretty cool.
GMTalker: Gaussian Mixture based Emotional talking video Portraits
GMTalker can generate high-fidelity talking video portraits with audio-lip sync and control over facial expressions as well as gaze and eye blinks.
SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration
And last but not least, SMERF is able to render near-photorealistic NeRF scenes at interactive frame rates. The method is able to handle large scenes with footprints up to 300 m² at a volumetric resolution of 3.5 mm³ and enables full six degrees of freedom (6DOF) navigation within a web browser and renders in real-time on commodity smartphones and laptops. This will make previewing your next flat way more convenient, checkout one of the demos!
Also interesting
- SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds
- 3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection
- Gaussian Splatting SLAM
- Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
- HeadArtist: Text-conditioned 3D Head Generation with Self Score Distillation
- Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
What comes after Large Language Models? According to RunwayML, it’s General World Models: systems that understand the visual world and its dynamics.
AI Tube is a experimental platform by @flngr where all the videos are generated from a single text prompt that generates the story, snippets, waveforms using open-source AI models and merges everything into a single video.
NUCA is a new AI-powered camera that redefines image creation. NUCA captures individuals in their purest form – no clothing, literally stripped down to their authentic selves in their natural state.
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
PatchFusion is yet another image depth estimation model. This one works for high resolution 4k images. HuggingFace demo.
InfEdit is a fast prompt-based image editing method that can change specific semantics of images without changing everything else.
@the_marconi shared the ComfyUI workflow behind his seasonal brand animations created with AnimateDiff.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa