AI Art Weekly #75

Hello there, my fellow dreamers, and welcome to issue #75 of AI Art Weekly! 👋

I got a heads up from Devin this week that he’ll take my job soon. But before I start my career as a podcaster, I’m gonna finish up Shortie so I can conquer the next generation’s short attention span. In case that’s not working out, I’ll just keep doing this newsletter. The 3500 of you might be an indicator that that’s a good idea. So thank you for that 🙏. And if I manage to get a 1000 of you to support me each month, I might not even have to worry about Devin 😲.

But enough of my sleazy self-promotion, let’s get to the good stuff:

  • Midjourney’s character consistency update
  • ELLA improves image prompt alignment using LLMs
  • DEADiff can do efficient style transfer
  • Follow-Your-Click and DragAnything can animate images with user input
  • VideoElevator enhances text-to-video models
  • ASVA animates images from audio
  • CRM generates 3D objects from a single image in 10 seconds
  • SplattingAvatar generates photorealistic real-time human avatars
  • StyleGaussian enables instant style transfer of any image’s style to a 3D scene
  • VLOGGER generates talking human videos from a single image and audio
  • and more!

Cover Challenge 🎨

Theme: sound of silence
81 submissions by 49 artists
AI Art Weekly Cover Art Challenge sound of silence submission by Saudade_nft
🏆 1st: @Saudade_nft
AI Art Weekly Cover Art Challenge sound of silence submission by SandyDamb
🥈 2nd: @SandyDamb
AI Art Weekly Cover Art Challenge sound of silence submission by moon__theater
🥉 3rd: @moon__theater
AI Art Weekly Cover Art Challenge sound of silence submission by JonaiGallery
🧡 4th: @JonaiGallery

News & Papers

Midjourney Character Consistency + Style Reference v2

Midjourney released a character consistency feature this week, one of the most requested features. It allows you to add a character reference image using --cref <image url> and the model will try to keep the character features consistent when generating an image. Using the --cw 0-100 lets you define how much of the character features should be preserved. With low values only preserving features of the face while high values will also preserve clothing and other features. All in and all a pretty cool update.

They also updated their style reference --sref <image url> feature to be more precise about understanding style and avoid ‘leaking’ non-style stuff into generated images.

Left character reference. Right generated image with --cw 14.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

ELLA is a lightweight approach to equip existing CLIP-based diffusion models with LLMs to improve prompt-understanding and enables long dense text comprehension for text-to-image models.

ELLA comparison toSDXL and DALL-E 3

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

DEADiff is another style transfer method. It is able to control the level of stylization and can be used for style mixing, stylized reference object generation, and can be combined with ControlNet and LoRAs.

DEADiff comparison with IPAdapter

Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Follow-Your-Click can animate specific regions of an image with a simple user click and a short motion prompt, and allows to control the speed of the animation.

Follow-Your-Click example

DragAnything: Motion Control for Anything using Entity Representation

DragAnything is another method that can animate images using user input. This one is able to control the motion of multiple objects simultaneously and distinctively by just drawing a trajectory line.

DragAnything example

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

VideoElevator is a training-free and plug-and-play method that can be used to enhance temporal consistency and add more photo-realistic details of text-to-video models by using text-to-image models.

ZeroScope comparison with and without SD 1.5. Prompt: Time lapse at the snow land with aurora in the sky.

ASVA: Audio-Synchronized Visual Animation

While image-to-video is cool, how about image+audio-to-video? ASVA can animate a static image with an audio clips while keeping frames and sound cues in sync. Keep in mind that this is only trained on SD 1.5. Can’t wait to see where we are with this in a year.

If you want to hear this lion shoot a gun, click here

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Feels like we get an image-to-3D method each week now. CRM is yet another one that can generate 3D objects from a single image. This one is able to create high-fidelity textured meshes with interactable surfaces in just 10 seconds. Results are stunning!

CRM examples

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Let’s talk Splats! SplattingAvatar can generate photorealistic real-time human avatars with Gaussian Splatting embedded on a triangle mesh. The technique is able to render avatars at over 300fps on modern GPUs and 30fps on mobile devices.

A grooving SplattingAvatar

StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

StyleGaussian on the other hand enables instant style transfer of any image’s style to a 3D scene at 10fps while preserving strict multi-view consistency.

StyleGaussian example

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

With a single image of a person and some audio, VLOGGER can generate a talking human video of variable length. Like HeyGen2, the method also supports video translation, where the generated video is translated to a different language while automatically editing the lip and face areas to be consistent with new audio.

VLOGGER example. Left input image. Right animation based on additional audio source. Check out the project page for examples with audio.

Also interesting

  • SM(^4)Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model
  • S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes
  • V3D: Video Diffusion Models are Effective 3D Generators
  • AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production
  • FontCLIP: A Semantic Typography Visual-Language Model for Multilingual Font Applications

Visions of Lotus 🪷” by me

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buying a physical art print to hang onto your wall

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa