Hello there, my fellow dreamers, and welcome to issue #53 of AI Art Weekly! 👋
I have another chock-full issue for you this week and an exciting surprise for this weeks cover challenge (more below). Let’s dive right into this weeks highlights:
- DALL·E 3 and GPT-4V available for free on Bing
- DREAM preprocesses your brainwaves into depth maps
- Ground-A-Video enables zero-shot video editing
- LLM-grounded Video Diffusion Models
- HumanNorm generates realistic 3D humans
- PIXART-α: Training a foundation Text-to-Image model for a fraction of the cost
- DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animations
- Image restoration with DA-CLIP
- Text-To-GIF model Hotshot XL
- Interview with “Strange History” curator Historic Crypto
- and more tutorials, tools and gems!
Cover Challenge 🎨
Looking Glass is sponsoring this weeks challenge and the winner will receive a Looking Glass Portrait (besides the usual $50). I bought one for myself last year and they’re super cool. If you want to get one for yourself regardless of the challenge, use my affiliate link to get 10% off.
News & Papers
DALL·E 3 and GPT-4V available for free on Bing
Microsoft quietly rolled out DALL·E 3 and GPT-4V into Bing last week, and it’s available for free (for now). DALL·E 3 can also be used separately through Bing Creator, so naturally I had to give it a try, and I’m positively surprised by its ability to understand natural text and generate readable text. While image quality isn’t comparable with Midjourney, it has a more unrefined output which I appreciate and prefer to Midjourney’s nowadays extremely clean results.
DREAM: Visual Decoding from REversing HumAn Visual SysteM
We’re getting closer to visualize our dreams. DREAM is an fMRI-to-image method for reconstructing viewed images from brain activities. It’s basically a preprocessor that is able to convert your brainwaves into semantics, color, and depth maps for ControlNet.
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
One key feature that I feel is missing from open-source AI video, is a good video-to-video option that enables video editing similar to Gen-1. Ground-A-Video is the latest addition to that family. The method allows you to edit multiple attributes of a video via Stable Diffusion and spatially-continuous & -discrete conditions, without any training. Unfortunately, as with most of methods in this category, there is no actual source code to use it 😒
LLM-grounded Video Diffusion Models
LLM-grounded Video Diffusion Models (LVD) is a new method that improves text-to-video generation by using a large language model to generate dynamic scene layouts from text and then guiding video diffusion models with these layouts, achieving realistic video generation that align with complex input prompts. Unforutnately, there is no actual video demo yet, so we’ve to wait to see how final results actually look like.
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
HumanNorm is a novel approach for high-quality and realistic 3D human generation by leveraging normal maps which enhances the 2D perception of 3D geometry. The results are quite impressive and comparable with PS3 games.
PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
PIXART-α is a new text-to-image model that is able to generate images with a resolution of up to 1024px and only required a training time of roughly 10% compared to that of Stable Diffusion 1.5 (~675 vs ~6’250 A100 GPU days). This is obviously much more cheaper as well ($26k compared to $320k). The model is also able to generate images with a high level of control and can be combined with Dreambooth to generate images of concepts that weren’t included in the original training.
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
DiffPoseTalk is a new method for generating stylistic 3D facial animations driven by speech and head pose. The method is based on diffusion models and a style encoder that extracts style embeddings from short reference videos. The results look pretty good and the method outperforms existing ones like SadTalker. No code yet unfortunately.
DA-CLIP: Controlling Vision-Language Models for Universal Image Restoration
DA-CLIP is a new method that can be used to restore images. Apart from inpainting, the method is able to restore images by dehazing, deblurring, denoising, derainining and desnowing them as well as removing unwanted shadows and raindrops or enhance lighting on low-light images.
More papers & gems
- TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields
- HGHOI: Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models
- LEAP: Liberate Sparse-view 3D Modeling from Camera Poses
- Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models
- SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D
- GETAvatar: Generative Textured Meshes for Animatable Human Avatars
In this latest AI Art Weekly interview I’m talking to @Historic_Crypto, the founder behind “Strange History”, a phenomena that started as a curated collection turned collective that couldn’t be less bothered about the current market and is constantly churning out new creative ways to re-imagine the past with the help of AI. Enjoy!
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buy a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!