AI Art Weekly #99

Hello there, my fellow dreamers, and welcome to issue #99 of AI Art Weekly! 👋

Went through another 452 papers for you in the last two weeks and there are some pretty cool innovations, but the no-code issue keeps getting worse! Wanted to ship a feature for this today, but I’m attending a wedding 🤵‍♂️👰 in an hour so I’ll have to postpone that until next week 🤞

Also added a bunch of new high-quality Midjourney styles to Promptcache (now at 130+ styles) as well as a Prompt Generator (although it still needs some instructions on how to use it 😅, hint: type [).

Anyway, enjoy your weekend and talk to you next week!


Cover Challenge 🎨

Theme: first memory
9 submissions by 8 artists
AI Art Weekly Cover Art Challenge first memory submission by beholdthe84
🏆 1st: @beholdthe84
AI Art Weekly Cover Art Challenge first memory submission by PriestessOfDada
🥈 2nd: @PriestessOfDada
AI Art Weekly Cover Art Challenge first memory submission by onchainsherpa
🥉 3rd: @onchainsherpa
AI Art Weekly Cover Art Challenge first memory submission by ranetas
🧡 4th: @ranetas

News & Papers

Highlights

We, Robot: Optimus, Robotaxi and Robovan

Today I woke up and thought I just got dropped into the iRobot timeline. Yesterday, Tesla unveiled their vision for humanities autonomous future:

  • Tesla Bot (Optimus): A humanoid robot for household chores and errands
  • Robotaxi: An autonomous vehicle for personal errands and commuting
  • Robovan: Autonomous transport for groups and goods

Granted, nothing of the above is available yet, and the robots are most likely teleoperated (for now), but it’s still a fascinating glimpse into the future, even though this could all go horribly wrong 🤖🔥

An hommage to Westworld – Optimus Bartender serving drinks and wearing a cowboy hat

3D

Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis

Trans4D can generate realistic 4D scene transitions with expressive object deformation.

Trans4D example

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

AvatarGO can generate 4D human-object interaction scenes from text. It uses LLM-guided contact retargeting for accurate spatial relations and ensures smooth animations with correspondence-aware motion optimization.

AvatarGO examples

UniMuMo: Unified Text, Music and Motion Generation

UniMuMo can generate outputs across text, music, and motion. It achieves this by aligning unpaired music and motion data based on rhythmic patterns.

UniMuMo example

EgoAllo: Estimating Body and Hand Motion in an Ego-sensed World

EgoAllo can estimate 3D human body pose, height, and hand parameters using images from a head-mounted device.

EgoAllo example

SynTalker: Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

SynTalker can generate realistic full-body motions that match speech and text prompts. It allows precise control of movements, like talking while walking.

SynTalker example

DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control

DART can generate high-quality human motions in real-time, achieving over 300 frames per second on a single RTX 4090 GPU. It combines text inputs with spatial constraints, allowing for tasks like reaching waypoints and interacting with scenes.

DART example

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

CLoSD can control characters in physics-based simulations using text prompts. It can navigate to goals, strike objects, and switch between sitting and standing, all guided by simple instructions.

CLoSD example

Dessie: Disentanglement for Articulated 3D Horse Shape and Pose Estimation from Images

Dessie can estimate the 3D shape and pose of horses from single images. It also works with other large animals like zebras and cows.

Dessie example

FabricDiffusion: High-Fidelity Texture Transfer for 3D Garments Generation from In-The-Wild Clothing Images

FabricDiffusion can transfer high-quality fabric textures from a 2D clothing image to 3D garments of any shape.

FabricDiffusion example

AniSDF: Fused-Granularity Neural Surfaces with Anisotropic Encoding for High-Fidelity 3D Reconstruction

AniSDF can reconstruct high-quality 3D shapes with improved surface geometry. It can handle complex, luminous, reflective as well as fuzzy objects.

AniSDF examples

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Flex3D can generate high-quality 3D assets from single images or text prompts.

Flex3D examples

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

DressRecon can create 3D human body models from single videos. It handles loose clothing and objects well, achieving high-quality results by combining general human shapes with specific video movements.

DressRecon examples

EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

EdgeRunner can generate high-quality 3D meshes with up to 4,000 faces at a spatial resolution of 512 from images and point-clouds.

EdgeRunner examples

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

Disco4D can generate and animate 4D human models from a single image by separating clothing from the body. It uses diffusion models for detailed 3D representations and can model parts that are not visible in the input image.

Disco4D examples

Image

SEMat: Towards Natural Image Matting in the Wild via Real-Scenario Prior

SEMat can improve interactive image matting! It enhances network design and training to achieve better transparency, detail, and accuracy than methods like MAM and SmartMat.

SEMat examples

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

OmniBooth can generate images with precise control over their layout and style. It allows users to customize images using masks and text or image guidance, making the process flexible and personal.

OmniBooth examples

FLUX Image Restoration: Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration

FLUX-IR can restore low-quality images to high-quality ones by optimizing paths through reinforcement learning.

FLUX Image Restoration example

ControlAR: Controllable Image Generation with Autoregressive Models

ControlAR adds controls like edges, depths, and segmentation masks to autoregressive models like LlamaGen.

ControlAR examples

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

DisEnvisioner can generate customized images from a single visual prompt and extra text instructions. It filters out irrelevant details and provides better image quality and speed without needing extra tuning.

DisEnvisioner examples

FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction

FreeEdit can edit images by adding, replacing, or removing objects without needing manual masks. It uses a special method called Decoupled Residual ReferAttention to improve detail from reference images.

FreeEdit example

Video

Pyramid Flow: Pyramidal Flow Matching for Efficient Video Generative Modeling

Pyramidal Flow Matching can generate high-quality 5 to 10-second videos at 768p resolution and 24 FPS. It uses a unified pyramidal flow matching algorithm to link flows across different stages, making video creation more efficient.

A side profile shot of a woman with fireworks exploding in the distance beyond her

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

PhysGen can generate realistic videos from a single image and user-defined conditions, like forces and torques. It combines physical simulation with video generation, allowing for precise control over dynamics.

PhysGen example

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes

MimicTalk can generate personalized 3D talking faces in under 15 minutes. It mimics a person’s talking style using a special audio-to-motion model, resulting in high-quality videos.

MimicTalk examples

ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler

ViBiDSampler can generate high-quality frames between two keyframes using a bidirectional sampling strategy. It can create 25 frames at 1024x576 resolution in just 195 seconds on a single 3090 GPU, making it a top choice for keyframe interpolation.

ViBiDSampler example

TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

TweedieMix can generate images and videos that combine multiple personalized concepts.

TweedieMix example

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher’s Guide

VideoGuide can improve the quality of videos made by text-to-video models without needing extra training. It enhances the smoothness of motion and clarity of images, making the videos more coherent and visually appealing.

VideoGuide example

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

TANGO can generate high-quality body-gesture videos that match speech audio from a single video. It improves realism and synchronization by fixing audio-motion misalignment and using a diffusion model for smooth transitions.

TANGO example

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

MonST3R can estimate 3D shapes from videos over time, creating a dynamic point cloud and tracking camera positions. This method improves video depth estimation and separates moving from still objects more effectively than previous techniques.

MonST3R example

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Loong can generate minute-long videos by treating text and video tokens as a single sequence.

Loong example

Inverse Painting: Reconstructing The Painting Process

Inverse Painting can generate time-lapse videos of the painting process from a target artwork. It uses a diffusion-based renderer to learn from real artists’ techniques, producing realistic results across different artistic styles.

Inverse Painting example

Stable Video Portraits

Stable Video Portraits can generate photorealistic videos of talking faces by using a text-to-image model and 3D Morphable Models (3DMM). It creates person-specific avatars that can be transformed into text-defined celebrities, producing smooth and high-quality videos without extra fine-tuning.

Stable Video Portraits examples

Audio

Presto!: Distilling Steps and Layers for Accelerating Music Generation

Presto! can generate 32 seconds of high-quality music in 230ms, making it the fastest option for text-to-music generation.

Presto! example

Also interesting

“CLAIR OBSCUR” by me.

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buying my Midjourney prompt collection on PROMPTCACHE 🚀
  • Buying a print of my art from my art shop. You can request any of my artworks to be printed, just reply to this email.

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa