AI Art Weekly #73
Hello there, my fellow dreamers, and welcome to issue #73 of AI Art Weekly! 👋
AI research seems like to slowly gain momentum again. It’s the first week of the year in which I’ve skimmed through 100+ papers again. Aside skimming papers, I’ve made good progress on Shortie this week and shared a little sneak peek on X. Learned a lot while building this and have tons of new ideas in the back of my head, just waiting to get out. But for now, lets take a look at this weeks highlights:
- EMO turns single images into expressive lip-synced videos
- Google’s Genie can turn images into platformer games
- Multi-LoRA optimizes image generation with, well, multiple LoRAs
- While DiffuseKronA tries to avoid using LoRAs
- TCD is a better LCM
- VastGaussian reconstructs large scenes as 3D Gaussians
- GEA reconstructs expressive 3D avatars from monocular video
- GEM3D is another text-to-3D model with different approach
- LayoutLearning generates 3D scenes out of multi-objects
- OHTA generates hand avatars from single images
- SongComposer and ChatMusician are LLMs that generate music
- and more!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do 🙏
Cover Challenge 🎨
For next weeks cover I’m looking for invisibility inspired submissions! Reward is again $50 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here.
News & Papers
Emote Portrait Alive
Lip-syncing just got the Sora treatment. Alibaba’s EMO can turn any image of a face into into a video expressing the words and emotions from any audio file (singing or talking).
Genie: Generative Interactive Environments
Another mind blowing paper comes from Google this week. They presented Genie, a foundation world model trained on internet videos that can generate an endless variety of playable worlds from synthetic images, photographs, and even sketches. This opens the door to a variety of new ways to generate and step into virtual worlds. Quite wild if you think about it!
Multi-LoRA Composition for Image Generation
Multi-LoRA Composition focuses on the integration of multiple Low-Rank Adaptations (LoRAs) to create highly customized and detailed images. The approach is able to generate images with multiple elements without fine-tuning and without losing detail or image quality.
DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Model
On the other hand, DiffuseKronA is another method that tries to avoid having to use LoRAs and wants to personalize just from input images. This one generates high-quality images with accurate text-image correspondence and improved color distribution from diverse and complex input images and prompts.
TCD: Trajectory Consistency Distillation
While LCM and Turbo have unlocked near real-time image diffusion, the quality is still a bit lacking. TCD on the other hand manages to generate images with both clarity and detailed intricacy without compromising on speed.
VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction
VastGaussian is a new 3D Gaussian Splatting method for high-quality reconstruction and real-time rendering of large scenes.
GEA: Reconstructing Expressive 3D Gaussian Avatar from Monocular Video
And speaking about Gaussian Splats, GEA is a method that can create expressive 3D avatars with high-fidelity reconstructions of body and hands from a single video based on 3D Gaussians.
GEM3D: Generative Medial Abstractions for 3D Shape Synthesis
GEM3D is a new deep, topology-aware generative model of 3D shapes. The method is able to generate diverse and plausible 3D shapes from user-modeled skeletons, making it possible to draw the rough structure of an object and have the model fill in the rest.
Layout Learning
LayoutLearning generates 3D scenes from text that are automatically decomposed into objects. This means that given a prompt like a chef rat on a tiny stool cooking a stew
, the model will generate a 3D scene with a chef rat, a tiny stool, and a stew as separate objects.
OHTA: One-shot Hand Avatar via Data-driven Implicit Priors
While diffusion models still struggle with generating accurate human hands, OHTA is able to create high-fidelity and drivable hand avatars from just a single image. The hands can also be generated using only text and allow for hand texture and geometry editing.
SongComposer and ChatMusician
We all know by now that LLMs are great at solving all sorts of different tasks. Music wasn’t one of them, until now. SongComposer and ChatMusician are two LLMs that are trained on composing music through symbolic or ABC notations. While SongComposer focuses on generating vocals, ChatMusician generates ABC notations that can be used with existing music tools.
Also interesting
- ViewFusion: Towards Multi-View Consistency via Interpolated Denoising
- FlowMDM: Seamless Human Motion Composition with Blended Positional Encodings
- OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
Remember the guy who fakes his life on Instagram? He did a TEDx talk, worth a watch.
It’s probably still a long way until OpenAI is going to release Sora, but they surely aren’t gonna stop teasing us about it.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa