AI Art Weekly #107
Hello my fellow dreamers, and welcome to issue #107 of AI Art Weekly! 👋
It’s been an eventful week in AI. OpenAI’s 12 days of hype continue, while we’ve seen the release of a new SOTA open-source video model, a glimpse into the future of gaming, and 19 noteworthy papers (carefully selected from 227 publications by yours truly).
Quick reminder: The 40% Cyberweek discount for Premium and the Midjourney Prompt Library ends this week. Don’t miss out on this significant saving!
I’ll return in two weeks with the final issue of 2024. Until then, stay creative! 🙏
Unlock the full potential of AI-generated art with my curated collection of 200+ high-quality Midjourney SREF codes and 1000+ creative prompts.
Cover Challenge 🎨
News & Papers
Highlights
HunyuanVideo
Tencent released a new 13B open-source text-to-video state of the art model this week.
The weights are available on HuggingFace but it requires at least 45GB of VRAM to run. Luckily it’s already available on Fal and Replicate, although it currently takes 8 minutes on a single H100 GPU to generate a 5 second clip.
However, it’s fair to expect that speeds are only going to improve with quantization and more GPUs (e.g. 4xH100)!
Genie 2
Google DeepMind revealed the next generation of their Genie model. Advancing from its 2D predecessor, Genie 2 can can generate playable 3D worlds from a single image which can be controlled via keyboard and mouse inputs. Some key features are:
- It can create 3D worlds from text/image
- Simulates physics and character animations
- Supports multiple camera perspectives (FPS, isometric, third-person)
- Enables NPC interactions
- Maintains consistency for up to 60 seconds
Now this isn’t a public release, but nonetheless extremely interesting. A few weeks ago, game devs dismissed earlier versions of this tech. Now, look at it. Future games won’t require engines or development time. You’ll simply imagine it and play in seconds.
MV-Adapter
Being able to generate consistent multi-view images is the key to good 3D gen. MV-Adapter is the newest tool for that task. It can create up to 40 views either from only text, an image or different CN and works with various SDXL models.
3D
Trellis 3D: Structured 3D Latents for Scalable and Versatile 3D Generation
Trellis 3D generates high-quality 3D assets in formats like Radiance Fields, 3D Gaussians, and meshes. It supports text and image conditioning, offering flexible output format selection and local 3D editing capabilities.
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
MIDI can generate 3D scenes from a single image using a multi-instance diffusion model. It processes scenes in about 40 seconds and effectively captures how objects interact in space.
SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation
SceneFactor generates 3D scenes from text using an intermediate 3D semantic map. This map can be edited to add, remove, resize, and replace objects, allowing for easy regeneration of the final 3D scene.
3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting
3DSceneEditor can edit complex 3D scenes in real-time using Gaussian Splatting. It allows users to add, move, change colors, replace, and delete objects based on prompts.
TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
TexGaussian can generate high-quality PBR materials for 3D meshes in one step. It produces albedo, roughness, and metallic maps quickly and with great visual quality, ensuring better consistency with the input geometry.
Image
Anagram-MTL can generate visual anagrams that change appearance with transformations like flipping or rotating.
Negative Token Merging: Image-based Adversarial Feature Guidance
Negative Token Merging can improve image diversity by pushing apart similar features during the reverse diffusion process. It reduces visual similarity with copyrighted content by 34.57% and works well with Stable Diffusion as well as Flux.
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
Generative Photography can generate consistent images from text with an understanding of camera physics. The method can control camera settings like bokeh and color temperatures to create consistent images with different effects.
InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences
InstantSwap can swap concepts in images from a reference image while keeping the foreground and background consistent. It uses automated bounding box extraction and cross-attention to make the process more efficient by reducing unnecessary calculations.
ControlFace: Harnessing Facial Parametric Control for Face Rigging
ControlFace can edit face images with precise control over pose, expression, and lighting. It uses a dual-branch U-Net architecture and is trained on facial videos to ensure high-quality results while keeping the person’s identity intact.
Video
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
MEMO can generate talking videos from images and audio. It keeps the person’s identity consistent and matches lip movements to the audio, producing natural expressions.
Motion Prompting: Controlling Video Generation with Motion Trajectories
Motion Prompting can control video generation using motion paths. It allows for camera control, motion transfer, and drag-based image editing, producing realistic movements and physics.
Imagine360: Immersive 360 Video Generation from Perspective Anchor
Imagine360 can generate high-quality 360° videos from monologue single-view videos.
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
Align3R can estimate depth maps, point clouds, and camera positions from single videos.
MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection
MamKPD is a lightweight pose estimation framework that detects 2D keypoints in real time, achieving 1492 frames per second on an NVIDIA GTX 4090 GPU.
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
One Shot, One Talk can create a fully expressive whole-body talking avatar from a single image. It uses pose-guided image-to-video diffusion models for realistic animation.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
FLOAT can create talking portrait videos from a single image and audio file.
Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
Synergizing Motion and Appearance can generate high-quality talking head videos by combining facial identity from a source image with motion from a driving video.
VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
VISION-XL can deblur and upscale videos using SDXL. It supports different aspect ratios and can produce HD videos in under 2.5 minutes on a single NVIDIA 4090 GPU, using only 13GB of VRAM for 25-frame videos.
Also interesting
--sref 3229281181
. Surreal Syntax is a vibrant and whimsical digital art style that combines playful urban themes with contemporary cinematic elements in shades of turquoise, perfect for creating eye-catching and imaginative images in Midjourney.
BRIA released a new Generative Fill ControlNet model that works together with their FAST LORA LCM finetune and can inpaint images in under two seconds.
@Kijaidesign already cracked video-to-video with HunhuyanVideo model and it passed the vid2vid hippo test. The clip of 101 frames at 768x432 took about 2 minutes to render.
@pablostanley shared an interesting concept for a user interface that lets you change the style of an image by using a style matrix.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying my Midjourney prompt collection on PROMPTCACHE 🚀
- Buying access to AI Art Weekly Premium 👑
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa