AI Art Weekly #78
Hello there, my fellow dreamers, and welcome to issue #78 of AI Art Weekly! 👋
This week we had AR contact lenses, flirting with AI, and Microsoft building a Stargate. AI isn’t slowing down my friends, and neither am I. This week I’ve gone through another round of 160+ papers for you so you and I can stay ahead of the curve.
In this issue:
- Audio: Stable Audio 2.0, SunoAI V3
- 3D: FlexiDreamer, StructLDM, Design2Cloth, MaGRITTe, CityGaussian, Feature Splatting, Freditor, GeneAvatar, GenN2N, ProbTalk
- Image: CosmicMan, ID2Reflectance, EdgeDepth, HairFastGAN, SPRIGHT-T2I, LCM-Lookahead, InstantStyle, DreamWalk
- Video: CameraCtrl, VIDIM, Motion Inversion, DSTA, EDTalk
- and more!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do 🙏
Cover Challenge 🎨
For next weeks cover I’m looking for eclipse submissions! Reward is again $50 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here.
News & Papers
Audio
Stable Audio 2.0
Stability AI released Stable Audio 2.0 this week. It can generate high-quality, full tracks with coherent musical structure up to three minutes in length at 44.1kHz stereo from a single natural language prompt. The new model also introduces audio-to-audio generation, allowing to transform audio samples using text prompts. Pretty cool stuff.
SunoAI V3
Similar to Stable Audio, Sunov3 lets you create two minute tracks from a single text prompt, but it also supports vocals. I tried it this week and was blown away. Everybody can create their own theme song now. TÚLPA TÚLPA OOOOH OOOOH 🙌.
3D
FlexiDreamer: Single Image-to-3D Generation with FlexiCubes
FlexiDreamer is yet another single image-to-3D generation framework. Takes approximately 1 minute on a single NVIDIA A100 GPU.
StructLDM: Structured Latent Diffusion for 3D Human Generation
StructLDM can generate animatable compositional humans by blending different body parts, identity swapping, local clothing editing, 3D virtual try-on, etc. AI girlfriends/boyfriends are definitely gonna be a thing.
Design2Cloth: 3D Cloth Generation from 2D Masks
Design2Cloth on the other hand is a high fidelity 3D generative model that can generate diverse and highly detailed clothes simply by drawing a 2D cloth mask. Even supports interpolation.
MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text
MaGRITTe can generate 3D scenes from a combination of an image, top-view (floor plans or terrain maps) and a text prompt. Would be super cool to create one of these low-poly game levels with this.
CityGaussian: Real-time High-quality Large-Scale Scene Rendering with Gaussians
Speaking about levels, CityGaussian can reconstruct and render large-scale 3D scenes with high-quality and in real-time using Gaussian splatting.
Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing
And talking about Splats, Feature Splatting can manipulate both the appearance and the physical properties of objects in a 3D scene using text prompts.
Freditor: High-Fidelity and Transferable NeRF Editing by Frequency Decomposition
NeRFs aren’t dead yet. Freditor is a method that enables high-fidelity and transferable editing of NeRF scenes.
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
GeneAvatar is a semantic-driven NeRF editing approach that can be used to edit the geometry and texture of 3D avatars using drag-style, text-prompt, and pattern painting methods.
GenN2N: Generative NeRF2NeRF Translation
And because methods always come in pairs, GenN2N is another NeRF editing method. This one can edit scenes using text prompts, colorize, upscale and inpaint them.
ProbTalk: Towards Variable and Coordinated Holistic Co-Speech Motion Generation
ProbTalk is a method for generating lifelike holistic co-speech motions for 3D avatars. The method is able to generate a wide range of motions and ensures a harmonious alignment among facial expressions, hand gestures, and body poses.
Image
CosmicMan: A Text-to-Image Foundation Model for Humans
CosmicMan is a new text-to-image foundation model specialized for generating high-fidelity human images.
ID2Reflectance: Monocular Identity-Conditioned Facial Reflectance Reconstruction
ID2Reflectance can generate high-quality facial reflectance maps from a single image.
EdgeDepth: Monocular Depth Estimation with Edge-aware Consistency Fusion
EdgeDepth is a new method for monocular depth estimation that relies solely on edge maps as input which results in sharper and detail rich depth maps.
HairFastGAN: Realistic and Robust Hair Transfer with a Fast Encoder-Based Approach
Want to see how you look like with a new hair style? HairFastGAN can transfer hairstyles from a reference image to an input photo for virtual hair try-on.
SPRIGHT-T2I: Getting it Right
Following spatial instructions in text-to-image prompts is hard! SPRIGHT-T2I can finally do it though, resulting in more coherent and accurate compositions.
LCM-Lookahead for Encoder-based Text-to-Image Personalization
LCM-Lookahead is another attempted LoRA killer with an LCM-based approach for identity transfer in text-to-image generations.
InstantStyle
InstantStyle is yet another text-to-image method to preserve the style of reference images without the need for any additional fine-tuning. HOW MANY MORE?
DreamWalk: Style Space Exploration using Diffusion Guidance
DreamWalk is a new method that can apply different styles to an image generation and interpolate between them. Pretty cool.
Video
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Camera control for text-to-video here! CameraCtrl enables accurate camera pose control which allows for the precise control of camera angles and movements when generating videos.
VIDIM: Video Interpolation With Diffusion Models
Good news: VIDIM is a generative model for video interpolation, which creates short videos given a start and end frame. Bad news: It’s from Google :(
Motion Inversion for Video Customization
Motion Inversion can be used to customize the motion of videos by matching the motion of a different video.
DSTA: Video-Based Human Pose Regression via Decoupled Space-Time Aggregation
DSTA is a new method for video-based human pose estimation which is able to directly map input to output joint coordinates.
EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis
And last but not least, EDTalk can generate talking face videos with different mouth shapes, head poses, and expressions from a single image, and can also animate the face directly from audio.
Also interesting
- Sketch-to-Architecture: Generative AI-aided Architectural Design
- SOLE 🐾: Segment Any 3D Object with Language
The first ever Sora created music video has been released. Made by @guskamp.
At least that’s what the doomers and EU regulators think. Meanwhile, the AI is just vibing.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa