AI Art Weekly #79

Hello there, my fellow dreamers, and welcome to issue #79 of AI Art Weekly! 👋

10:15pm here as I write these lines. Went through another 140+ papers for you this week and found some really cool stuff and two adorable little robot soccer players. It’s late, so I’ll keep this intro short.

In this issue:

  • 3D: InstantMesh, InstructHumans, ZeST, MCC-HO, Key2Mesh, SphereHead, TeFF
  • Physics: PhysAvatar, NeRF2Physics
  • Image: BeyondScene, MuDI, Imagine Colorization, GoodDrag, ControlNet++, PanFusion, MindBridge
  • Video: SpaTracker, SGM-VFI
  • and more!

Cover Challenge 🎨

Theme: eclipse
132 submissions by 88 artists
AI Art Weekly Cover Art Challenge eclipse submission by koldo2k
🏆 1st: @koldo2k
AI Art Weekly Cover Art Challenge eclipse submission by Artificial_KoS
🥈 2nd: @Artificial_KoS
AI Art Weekly Cover Art Challenge eclipse submission by HappyDoji
🥉 3rd: @HappyDoji
AI Art Weekly Cover Art Challenge eclipse submission by junkpile10
🧡 4th: @junkpile10

News & Papers


InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Let’s start again with 3D! InstantMesh can create diverse 3D assets within 10 seconds from a single image.

InstantMesh examples


InstructHumans can edit existing 3D human textures using text prompts. It maintains avatar consistency pretty well and enables easy animation.

Dancing 3D avatars edited with InstructHumans

ZeST: Zero-Shot Material Transfer from a Single Image

ZeST can change the material of an object in an image to match a material example image. It can also perform multiple material edits in a single image and perform implicit lighting-aware edits on the rendering of a textured mesh.

ZeST examples

MCC-HO: Reconstructing Hand-Held Objects in 3D

MCC-HO can reconstruct 3D objects from a single RGB image and an estimated 3D hand. Why might this be useful? Think VR/AR. Tech like this will make it possible to create a digital twin of objects you are holding in your hands so you and others can interact with them in a virtual environment.

MCC-HO reconstructing hand-held objects from a single image

Key2Mesh: MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints

Speaking of reconstruction. Key2Mesh is yet another model that takes on 3D human mesh reconstruction, this time by utilizing 2D human pose keypoints as input instead of relying on visual data due to scarcity in image datasets with 3D labels.

Rocky Balboa Key2Mesh example

SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation

GANs aren’t dead yet. SphereHead generates stable and high-quality 3D full-head human faces from all angles with significantly fewer artifacts compared to previous methods. Best one I’ve seen so far.

SphereHead examples

TeFF: Learning 3D-Aware GANs from Unposed Images with Template Feature Field

TeFF is a similar method to SphereHead, this one supports more than just human faces and can reconstruct a 3D object from the 360 view of a single image.

3D elephants reconstructed with TeFF


PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

PhysAvatar can turn multi-view videos into high-quality 3D avatars with loose-fitting clothes. The whole thing can be animated and generalizes well to unseen motions and lighting conditions.

PhysAvatar’s dancing to novel motions

NeRF2Physics: Physical Property Understanding from Language-Embedded Feature Fields

NeRF2Physics can predict the physical properties (mass, friction, hardness, thermal conductivity and Young’s modulus) of objects from a collection of images. This makes it possible to simulate the physical behavior of digital twins in a 3D scene.

Predicting physical properties of objects with NeRF2Physics


BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

BeyondScene can generate human-centric scenes with a resolution of up to 8K with exceptional text-image correspondence and naturalness using existing pretrained diffusion models.

8K BeyondScene example

MuDI: Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

We’ve seen gazillion text-to-image personalization methods already. MuDI is another one, but it supports multi-subject personalization. This means you can generate images of multiple subjects without identity mixing.

The otter and that other thing from the Sora videos having a good time together

Imagine Colorization

Imagine Colorization leverages pre-trained diffusion models to colorize images while supporting controllable and user-interactive capabilities.

Imagine Colorization examples

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

We’ve seen image editing by dragging before. GoodDrag brings improvements to stability and image quality to drag editing with diffusion models.

GoodDrag example

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

ByteDance is working on ControlNet++. It claims to improve controllable image generation by explicitly optimizing pixel-level cycle consistency between generated images and conditional controls, bringing improvements to conditional controls such as segmentation masks, line-art edges, depth maps, hed edges and canny edges.

ControlNet++ canny edge examples

PanFusion: Taming Stable Diffusion for Text to 360° Panorama Image Generation

PanFusion can generate 360-degree panorama images from a text prompt. The model is able to integrate additional constraints like room layout for customized panorama outputs.

PanFusion examples

MindBridge: A Cross-Subject Brain Decoding Framework

In the Minority Report department we have MindBridge this week. It’s another method that can reconstruct images from fMRI signals and can generalize to multiple subjects from only one model.

MindBridge examples from different subjects


SpaTracker: Tracking Any 2D Pixels in 3D Space

Until now I’ve only seen pixel trackers on the 2D plane, SpaTracker can track any 2D pixels in 3D space, which allows for better handling of occlusions and out-of-plane rotations.

Tracking 2D pixels in 3D space with SpaTracker

SGM-VFI: Sparse Global Matching for Video Frame Interpolation with Large Motion

And last but not least, SGM-VFI is a new video frame interpolation method that is able to handle large motion in videos. The method uses sparse global matching to introduce global information into the estimated intermediate frames, resulting in more accurate and detailed output.

SGM-VFI comparison with other methods

Also interesting

  • Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior
  • UMBRAE: Unified Multimodal Decoding of Brain Signals
  • UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
  • GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh

Fragmentations I” by me.

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buying a physical art print to hang onto your wall

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa