AI Art Weekly #96

Hello there, my fellow dreamers, and welcome to issue #96 of AI Art Weekly! 👋

Mini-Tulpa has arrived and is doing perfectly well, so I’m back to writing this newsletter for you 🫡

Unfortunately for me, AI research never stops, so taking a break from the weekly issues means I’ve missed a lot of exciting papers! Luckily for you, I’m good at catching up, so I’ve compiled the 31 most interesting ones out of the pile of 712 papers that were published in the last four weeks into one banger issue.

Enjoy!

In this issue:

  • Highlights: OpenAI’s new o1 model, Hailuo video model, Google’s GameNGen
  • 3D: GVHMR, MeshFormer, SpaRP, TransGS, Human-VDM, MagicMan, LayerPano3D, Subsurface Scattering
  • Image: One-DM, LinFusion, CSGO, Iterative Object Count Optimization, MagicFace, CrossViewDiff, SwiftBrush v2, MegaFusion
  • Video: ViewCrafter, Follow-Your-Canvas, tps-inbetween, TVG, PersonaTalk, PoseTalk, Loopy, DepthCrafter, Generative Inbetweening, CustomCrafter, TrackGo
  • Audio: Draw an Audio, Audio Match Cutting
  • and more!

Cover Challenge 🎨

Theme: picture-in-picture
37 submissions by 24 artists
AI Art Weekly Cover Art Challenge picture-in-picture submission by YedaiArt
🏆 1st: @YedaiArt
AI Art Weekly Cover Art Challenge picture-in-picture submission by NomadsVagabonds
🥈 2nd: @NomadsVagabonds
AI Art Weekly Cover Art Challenge picture-in-picture submission by SandyDamb
🥉 3rd: @SandyDamb
AI Art Weekly Cover Art Challenge picture-in-picture submission by VirginiaLori
🧡 4th: @VirginiaLori

News & Papers

Highlights

OpenAI’s new o1 model

OpenAI released their new o1 models o1-mini and o1-preview yesterday. Two new large language models that significantly advance AI reasoning capabilities. These models use reinforcement learning to develop complex chains of thought before responding to queries.

According to their own benchmarks, they:

  • Rank in the 89th percentile on competitive programming questions
  • Place among top 500 US students in USA Math Olympiad qualifier
  • Exceed PhD-level accuracy on physics, biology, and chemistry problems
  • Improve performance with increased training and thinking time

The early version, o1-preview, and the o1-mini version, are available in ChatGPT for Plus users and to API consumers with Usage Level Tier 5 (which you have if you spent $1k on their API).

o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.

Hailuo video model

MiniMax, a Chinese AI startup backed by Alibaba and Tencent, has released Hailuo AI, its text-to-video model competing with OpenAI’s Sora, Runway’s Gen-3 and LumaLabs’ DreamMachine.

Their model supports:

  • 6-second clips at 1280x720 resolution, 25 fps
  • Realistic human movements
  • English and Chinese prompt support

The clips I’ve seen so far have the highest coherency and dynamic range I’ve seen in a text-to-video model. But before you check them out, be aware that you need a phone number to sign up for their service.

Hailuo AI example

Google’s GameNGen

Google developed a neural modal called GameNGen that can simulate the classic game DOOM in real-time at over 20 frames per second. The model predicts the next frame with high quality, making it hard for people to tell the difference between real and simulated gameplay. Just crazy.

GameNGen demo

3D

World-Grounded Human Motion Recovery via Gravity-View Coordinates

GVHMR can recover human motion from monocular videos by estimating poses in a Gravity-View coordinate system aligned with gravity and the camera.

GVHMR examples

MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

MeshFormer can generate high-quality 3D textured meshes from just a few 2D images in seconds.

MeshFormer examples

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

SPA-RP can create 3D textured meshes and estimate camera positions from one or a few 2D images. It uses 2D diffusion models to quickly understand 3D space, achieving high-quality results in about 20 seconds.

SpaRP example

Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering

TransGS can instantly translate physically-based facial assets into a structured Gaussian representation for real-time rendering at 30fps@1440p on mobile devices.

TransGS example

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

Human-VDM can generate high-quality 3D human models from a single RGB image.

Human-VDM examples

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

MagicMan can generate high-quality 3D images and normal maps of humans from a single photo.

MagicMan examples

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.

LayerPano3D examples

Subsurface Scattering for 3D Gaussian Splatting

Subsurface Scattering for Gaussian Splatting can render and relight translucent objects in real time. It allows for detailed material editing and achieves high visual quality at around 150 FPS.

SSS examples

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

TEDRA can edit dynamic 3D avatars based on text prompts. It allows detailed changes to clothing styles while ensuring high quality and smooth movement using a personalized diffusion model.

TEDRA examples

Image

One-Shot Diffusion Mimicker for Handwritten Text Generation

One-DM can generate handwritten text from a single reference sample, mimicking the style of the input. It captures unique writing patterns and works well across multiple languages.

One-DM examples

LinFusion: 1 GPU, 1 Minute, 16K Image

LinFusion can generate high-resolution images up to 16K in just one minute using a single GPU. It improves performance on various Stable Diffusion versions and works with pre-trained components like ControlNet and IP-Adapter.

A 16384x8192-resolution example in the theme of Black Myth: Wukong generated by LinFusion.

CSGO: Content-Style Composition in Text-to-Image Generation

CSGO can perform image-driven style transfer and text-driven stylized synthesis. It uses a large dataset with 210k image triplets to improve style control in image generation.

CSGO examples

Iterative Object Count Optimization for Text-to-image Diffusion Models

Iterative Object Count Optimization can improve object counting accuracy in text-to-image diffusion models.

Iterative Object Count Optimization example

MagicFace: Training-free Universal-Style Human Image Customized Synthesis

MagicFace can generate high-quality images of people in any style without needing extra training.

MagicFace examples

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

CrossViewDiff can generate high-quality street-view images from satellite-view images using a cross-view diffusion model.

CrossViewDiff architecture

SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

SwiftBrush v2 can improve the quality of images generated by one-step text-to-image diffusion models. Results look great, and apparently it ranks better than all GAN-based and multi-step Stable Diffusion models in benchmarks. No code though 🤷‍♂️

SwiftBrush v2 examples

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

MegaFusion can extend existing diffusion models for high-resolution image generation. It achieves images up to 2048x2048 with only 40% of the original computational cost by enhancing denoising processes across different resolutions.

MegaFusion examples

Video

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

ViewCrafter can generate high-quality 3D views from single or few images using a video diffusion model. It allows for precise camera control and is useful for real-time rendering and turning text into 3D scenes.

ViewCrafter examples

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Follow-Your-Canvas can outpaint videos at higher resolutions, from 512x512 to 1152x2048.

Follow-Your-Canvas examples

Thin-Plate Spline-based Interpolation for Animation Line Inbetweening

tps-inbetween can generate high-quality intermediate frames for animation line art. It effectively connects lines and fills in missing details, even during fast movements, using a method that models keypoint relationships between frames.

tps-inbetween example

TVG: A Training-free Transition Video Generation Method with Diffusion Models

TVG can create smooth transition videos between two images without needing training. It uses diffusion models and Gaussian Process Regression for high-quality results and adds controls for better timing.

TVG examples

PersonaTalk: Bring Attention to Your Persona in Visual Dubbing

PersonaTalk can achieve high-quality visual dubbing while keeping the speaker’s unique style and facial details. Works with audio, a guidance video or dubbing to another language.

PersonaTalk example

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

PoseTalk can generate lip-synchronized talking head videos from a single image, audio, and text prompts. It allows for free head poses and uses a Pose Latent Diffusion model to create diverse poses.

PoseTalk example and comparison with other methods. Check the project page for examples with audio.

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Loopy can generate lifelike video portraits from audio input. It captures non-speech movements and emotions without needing motion templates, resulting in high-quality outputs.

Loopy example. Check the project page for examples with audio.

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

DepthCrafter can generate long high-quality depth map sequences for videos. It uses a three-stage training method with a pre-trained image-to-video diffusion model, achieving top performance in depth estimation for visual effects and video generation.

DepthCrafter example

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Generative Inbetweening can create smooth video sequences between two keyframes.

Generative Inbetweening example

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

CustomCrafter can generate high-quality videos from text prompts and reference images. It improves motion generation with a Dynamic Weighted Video Sampling Strategy and allows for better concept combinations without needing extra video or fine-tuning.

CustomCrafter comparison

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

TrackGo can generate controllable videos by letting users move objects with free-form masks and arrows.

TrackGo example

Audio

Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Draw an Audio can generate high-quality audio that matches video by using drawn masks and loudness signals.

Draw an Audio example. Check the project page for this clip. Quite funny 😄

Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Audio Match Cutting can automatically find and create smooth audio transitions between video shots.

Audio Match Cutting example 😂

Also interesting

“Moon Girl” by me.

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buy my Midjourney prompt collection on PROMPTCACHE 🚀

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa