AI Art Weekly #96
Hello there, my fellow dreamers, and welcome to issue #96 of AI Art Weekly! 👋
Mini-Tulpa has arrived and is doing perfectly well, so I’m back to writing this newsletter for you 🫡
Unfortunately for me, AI research never stops, so taking a break from the weekly issues means I’ve missed a lot of exciting papers! Luckily for you, I’m good at catching up, so I’ve compiled the 31 most interesting ones out of the pile of 712 papers that were published in the last four weeks into one banger issue.
Enjoy!
In this issue:
- Highlights: OpenAI’s new o1 model, Hailuo video model, Google’s GameNGen
- 3D: GVHMR, MeshFormer, SpaRP, TransGS, Human-VDM, MagicMan, LayerPano3D, Subsurface Scattering
- Image: One-DM, LinFusion, CSGO, Iterative Object Count Optimization, MagicFace, CrossViewDiff, SwiftBrush v2, MegaFusion
- Video: ViewCrafter, Follow-Your-Canvas, tps-inbetween, TVG, PersonaTalk, PoseTalk, Loopy, DepthCrafter, Generative Inbetweening, CustomCrafter, TrackGo
- Audio: Draw an Audio, Audio Match Cutting
- and more!
Unlock the full potential of AI-generated art with my curated collection of Midjourney SREF codes and prompts. Use AIARTWEEKLY during checkout to get $10 off!
Cover Challenge 🎨
For the next cover I’m looking for submissions within submissions! Reward is again fame & glory and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here.
News & Papers
Highlights
OpenAI’s new o1 model
OpenAI released their new o1 models o1-mini and o1-preview yesterday. Two new large language models that significantly advance AI reasoning capabilities. These models use reinforcement learning to develop complex chains of thought before responding to queries.
According to their own benchmarks, they:
- Rank in the 89th percentile on competitive programming questions
- Place among top 500 US students in USA Math Olympiad qualifier
- Exceed PhD-level accuracy on physics, biology, and chemistry problems
- Improve performance with increased training and thinking time
The early version, o1-preview, and the o1-mini version, are available in ChatGPT for Plus users and to API consumers with Usage Level Tier 5 (which you have if you spent $1k on their API).
Hailuo video model
MiniMax, a Chinese AI startup backed by Alibaba and Tencent, has released Hailuo AI, its text-to-video model competing with OpenAI’s Sora, Runway’s Gen-3 and LumaLabs’ DreamMachine.
Their model supports:
- 6-second clips at 1280x720 resolution, 25 fps
- Realistic human movements
- English and Chinese prompt support
The clips I’ve seen so far have the highest coherency and dynamic range I’ve seen in a text-to-video model. But before you check them out, be aware that you need a phone number to sign up for their service.
Google’s GameNGen
Google developed a neural modal called GameNGen that can simulate the classic game DOOM in real-time at over 20 frames per second. The model predicts the next frame with high quality, making it hard for people to tell the difference between real and simulated gameplay. Just crazy.
3D
World-Grounded Human Motion Recovery via Gravity-View Coordinates
GVHMR can recover human motion from monocular videos by estimating poses in a Gravity-View coordinate system aligned with gravity and the camera.
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model
MeshFormer can generate high-quality 3D textured meshes from just a few 2D images in seconds.
SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views
SPA-RP can create 3D textured meshes and estimate camera positions from one or a few 2D images. It uses 2D diffusion models to quickly understand 3D space, achieving high-quality results in about 20 seconds.
Instant Facial Gaussians Translator for Relightable and Interactable Facial Rendering
TransGS can instantly translate physically-based facial assets into a structured Gaussian representation for real-time rendering at 30fps@1440p on mobile devices.
Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models
Human-VDM can generate high-quality 3D human models from a single RGB image.
MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement
MagicMan can generate high-quality 3D images and normal maps of humans from a single photo.
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
LayerPano3D can generate immersive 3D scenes from a single text prompt by breaking a 2D panorama into depth layers.
Subsurface Scattering for 3D Gaussian Splatting
Subsurface Scattering for Gaussian Splatting can render and relight translucent objects in real time. It allows for detailed material editing and achieves high visual quality at around 150 FPS.
TEDRA: Text-based Editing of Dynamic and Photoreal Actors
TEDRA can edit dynamic 3D avatars based on text prompts. It allows detailed changes to clothing styles while ensuring high quality and smooth movement using a personalized diffusion model.
Image
One-Shot Diffusion Mimicker for Handwritten Text Generation
One-DM can generate handwritten text from a single reference sample, mimicking the style of the input. It captures unique writing patterns and works well across multiple languages.
LinFusion: 1 GPU, 1 Minute, 16K Image
LinFusion can generate high-resolution images up to 16K in just one minute using a single GPU. It improves performance on various Stable Diffusion versions and works with pre-trained components like ControlNet and IP-Adapter.
CSGO: Content-Style Composition in Text-to-Image Generation
CSGO can perform image-driven style transfer and text-driven stylized synthesis. It uses a large dataset with 210k image triplets to improve style control in image generation.
Iterative Object Count Optimization for Text-to-image Diffusion Models
Iterative Object Count Optimization can improve object counting accuracy in text-to-image diffusion models.
MagicFace: Training-free Universal-Style Human Image Customized Synthesis
MagicFace can generate high-quality images of people in any style without needing extra training.
CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis
CrossViewDiff can generate high-quality street-view images from satellite-view images using a cross-view diffusion model.
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher
SwiftBrush v2 can improve the quality of images generated by one-step text-to-image diffusion models. Results look great, and apparently it ranks better than all GAN-based and multi-step Stable Diffusion models in benchmarks. No code though 🤷♂️
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning
MegaFusion can extend existing diffusion models for high-resolution image generation. It achieves images up to 2048x2048 with only 40% of the original computational cost by enhancing denoising processes across different resolutions.
Video
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter can generate high-quality 3D views from single or few images using a video diffusion model. It allows for precise camera control and is useful for real-time rendering and turning text into 3D scenes.
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Follow-Your-Canvas can outpaint videos at higher resolutions, from 512x512 to 1152x2048.
Thin-Plate Spline-based Interpolation for Animation Line Inbetweening
tps-inbetween can generate high-quality intermediate frames for animation line art. It effectively connects lines and fills in missing details, even during fast movements, using a method that models keypoint relationships between frames.
TVG: A Training-free Transition Video Generation Method with Diffusion Models
TVG can create smooth transition videos between two images without needing training. It uses diffusion models and Gaussian Process Regression for high-quality results and adds controls for better timing.
PersonaTalk: Bring Attention to Your Persona in Visual Dubbing
PersonaTalk can achieve high-quality visual dubbing while keeping the speaker’s unique style and facial details. Works with audio, a guidance video or dubbing to another language.
PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation
PoseTalk can generate lip-synchronized talking head videos from a single image, audio, and text prompts. It allows for free head poses and uses a Pose Latent Diffusion model to create diverse poses.
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Loopy can generate lifelike video portraits from audio input. It captures non-speech movements and emotions without needing motion templates, resulting in high-quality outputs.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
DepthCrafter can generate long high-quality depth map sequences for videos. It uses a three-stage training method with a pre-trained image-to-video diffusion model, achieving top performance in depth estimation for visual effects and video generation.
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Generative Inbetweening can create smooth video sequences between two keyframes.
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
CustomCrafter can generate high-quality videos from text prompts and reference images. It improves motion generation with a Dynamic Weighted Video Sampling Strategy and allows for better concept combinations without needing extra video or fine-tuning.
TrackGo: A Flexible and Efficient Method for Controllable Video Generation
TrackGo can generate controllable videos by letting users move objects with free-form masks and arrows.
Audio
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Draw an Audio can generate high-quality audio that matches video by using drawn masks and loudness signals.
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
Audio Match Cutting can automatically find and create smooth audio transitions between video shots.
Also interesting
@machine_mythos created this Animal Farm inspired AI short video using the new Hailuo text-to-video model.
@techhalla shared a tutorial on how to create your own South Park 3D characters.
@Visu_AI_Poetry asked neural networks to visualize the intense surge of brain activity which lasts about 30 seconds after the heart stops.
@doopiidoop created this haunting wonderful AI music video. Sleep well!
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buy my Midjourney prompt collection on PROMPTCACHE 🚀
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa