AI Art Weekly #56

Hello there, my fellow dreamers, and welcome to issue #56 of AI Art Weekly! 👋

Two major developments from the generative AI art front this week. Apple has released research for Matryoshka Diffusion Models, their take on text-to-image models. Latent Consistency Models might be the next evolution of Diffusion models which can create images much faster in 1-4 steps. Let’s jump in:

  • Latent Consistency Models – faster text-to-image!
  • Apple’s Matryoshka Diffusion Models
  • FreeNoise can create text-to-video with 512 frames
  • DreamCraft3D: High quality 3D generation
  • Wonder3D is another image to 3D method
  • Zero123++ can generate multi-view images from a single input image
  • E4S is a new method for fine-grained face swapping
  • and more tutorials, tools and gems!

Cover Challenge 🎨

Theme: black and white
171 submissions by 112 artists
AI Art Weekly Cover Art Challenge black and white submission by cubes811_nft
🏆 1st: @cubes811_nft
AI Art Weekly Cover Art Challenge black and white submission by TheVaticinator
🥈 2nd: @TheVaticinator
AI Art Weekly Cover Art Challenge black and white submission by wolftribe8
🥉 3rd: @wolftribe8
AI Art Weekly Cover Art Challenge black and white submission by RachelSTWood
🧡 4th: @RachelSTWood

News & Papers

Latent Consistency Models: Synthesizing High-Resolution Images with Few-step Inference

There is a new category of generative models emerging, called Latent Consistency Models (LCMs). These models can be distilled from pre-trained Stable Diffusion models and are able to generate high quality 768x768 resolution images in only one to four steps, significantly accelerating text-to-image generation. For comparison, traditional diffusion models require 20-50 steps. Early signs show that this will bump up the speed of image generation to 100ms on powerful GPUs with some further optimizations.

4-step LCM examples

Matryoshka Diffusion Models

Apple is getting into the generative AI game. Matryoshka Diffusion Models (MDM) are their latest research for generating high-quality text-to-image & text-to-video with a multi-resolution diffusion model that can generate results at a resolution of up to 1024x1024 pixels. Compared to Stable Diffusion or Google’s Imagine, the MDM doesn’t require a pre-trained VAE or any additional upscaling modules and can be trained much more efficient. The code isn’t available yet, but will apparently get released soon.

Matryoshka Diffusion Model Video examples

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

FreeNoise is a new method that can generate longer videos with up to 512 frames from multiple text prompts. That’s about 21 seconds for a 24fps video. The method doesn’t require any additional fine-tuning on the video diffusion model and only takes about 20% more time compared to the original diffusion process.

Input: A bigfoot walking in the snowstorm; Resolution: 1024 x 576; Frames: 64.

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

3D generations are getting more sophisticated by the week. DreamCraft3D can create high-quality 3D objects from a single prompt. It uses a 2D reference image to guide the sculpting of the 3D object and then improves texture fidelity by running it through a fine-tuned Dreambooth model.

Humoristic san goku body mixed with wild boar head running, amazing high tech fitness room digital illustration generated by DreamCraft3D

Wonder3D: Single Image to 3D using Cross-Domain Diffusion

Wonder3D is yet another image-to-3D method. This one is able to convert a single image into a high-fidelity 3D model, complete with textured meshes and color. The entire process takes only 2 to 3 minutes.

Wonder3D examples

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

Zero123++ is a new model that can generate multi-view images from a single input image. Which gave me another opportunity to test how my avatar might look from another angle. Still not impressed…

Zero123+++ example

E4S: Fine-Grained Face Swapping via Regional GAN Inversion

E4S is a new method for fine-grained face swapping. It’s able to swap faces in images and videos, while preserving the source identity, texture, shape, and lighting of the original footage.

E4S example

More papers & gems

  • MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion
  • Relit-NeuLF: Efficient Novel View Synthesis with Neural 4D Light Field
  • PERF: Panoramic Neural Radiance Field from a Single Panorama
  • DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation

Tools & Tutorials

These are some of the most interesting resources I’ve come across this week.

Cosmic Reflections by me

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buy a physical art print to hang onto your wall

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa