AI Art Weekly #107

Hello my fellow dreamers, and welcome to issue #107 of AI Art Weekly! 👋

It’s been an eventful week in AI. OpenAI’s 12 days of hype continue, while we’ve seen the release of a new SOTA open-source video model, a glimpse into the future of gaming, and 19 noteworthy papers (carefully selected from 227 publications by yours truly).

Quick reminder: The 40% Cyberweek discount for Premium and the Midjourney Prompt Library ends this week. Don’t miss out on this significant saving!

I’ll return in two weeks with the final issue of 2024. Until then, stay creative! 🙏


Cover Challenge 🎨

Theme: error
32 submissions by 19 artists
AI Art Weekly Cover Art Challenge error submission by ManoelKhan
🏆 1st: @ManoelKhan
AI Art Weekly Cover Art Challenge error submission by elfearsfoxsox
🥈 2nd: @elfearsfoxsox
AI Art Weekly Cover Art Challenge error submission by skidmarxist1
🥈 2nd: @skidmarxist1
AI Art Weekly Cover Art Challenge error submission by mamaralic
🥉 3rd: @mamaralic

News & Papers

Highlights

HunyuanVideo

Tencent released a new 13B open-source text-to-video state of the art model this week.

The weights are available on HuggingFace but it requires at least 45GB of VRAM to run. Luckily it’s already available on Fal and Replicate, although it currently takes 8 minutes on a single H100 GPU to generate a 5 second clip.

However, it’s fair to expect that speeds are only going to improve with quantization and more GPUs (e.g. 4xH100)!

A cat walks on the grass, realistic style

Genie 2

Google DeepMind revealed the next generation of their Genie model. Advancing from its 2D predecessor, Genie 2 can can generate playable 3D worlds from a single image which can be controlled via keyboard and mouse inputs. Some key features are:

  • It can create 3D worlds from text/image
  • Simulates physics and character animations
  • Supports multiple camera perspectives (FPS, isometric, third-person)
  • Enables NPC interactions
  • Maintains consistency for up to 60 seconds

Now this isn’t a public release, but nonetheless extremely interesting. A few weeks ago, game devs dismissed earlier versions of this tech. Now, look at it. Future games won’t require engines or development time. You’ll simply imagine it and play in seconds.

Genie 2 generates new plausible content on the fly and maintains a consistent world for up to a minute

MV-Adapter

Being able to generate consistent multi-view images is the key to good 3D gen. MV-Adapter is the newest tool for that task. It can create up to 40 views either from only text, an image or different CN and works with various SDXL models.

MV-Adapter examples

3D

Trellis 3D: Structured 3D Latents for Scalable and Versatile 3D Generation

Trellis 3D generates high-quality 3D assets in formats like Radiance Fields, 3D Gaussians, and meshes. It supports text and image conditioning, offering flexible output format selection and local 3D editing capabilities.

Structured 3D Latents for Scalable and Versatile 3D Generation example

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

MIDI can generate 3D scenes from a single image using a multi-instance diffusion model. It processes scenes in about 40 seconds and effectively captures how objects interact in space.

MIDI example

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

SceneFactor generates 3D scenes from text using an intermediate 3D semantic map. This map can be edited to add, remove, resize, and replace objects, allowing for easy regeneration of the final 3D scene.

SceneFactor example

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

3DSceneEditor can edit complex 3D scenes in real-time using Gaussian Splatting. It allows users to add, move, change colors, replace, and delete objects based on prompts.

3DSceneEditor example

TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

TexGaussian can generate high-quality PBR materials for 3D meshes in one step. It produces albedo, roughness, and metallic maps quickly and with great visual quality, ensuring better consistency with the input geometry.

TexGaussian example

Image

Anagram-MTL: Diffusion-based Visual Anagram as Multi-task Learning

Anagram-MTL can generate visual anagrams that change appearance with transformations like flipping or rotating.

Diffusion-based Visual Anagram as Multi-task Learning example

Negative Token Merging: Image-based Adversarial Feature Guidance

Negative Token Merging can improve image diversity by pushing apart similar features during the reverse diffusion process. It reduces visual similarity with copyrighted content by 34.57% and works well with Stable Diffusion as well as Flux.

Negative Token Merging example

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Generative Photography can generate consistent images from text with an understanding of camera physics. The method can control camera settings like bokeh and color temperatures to create consistent images with different effects.

Generative Photography example

InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences

InstantSwap can swap concepts in images from a reference image while keeping the foreground and background consistent. It uses automated bounding box extraction and cross-attention to make the process more efficient by reducing unnecessary calculations.

InstantSwap examples

ControlFace: Harnessing Facial Parametric Control for Face Rigging

ControlFace can edit face images with precise control over pose, expression, and lighting. It uses a dual-branch U-Net architecture and is trained on facial videos to ensure high-quality results while keeping the person’s identity intact.

ControlFace example

Video

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

MEMO can generate talking videos from images and audio. It keeps the person’s identity consistent and matches lip movements to the audio, producing natural expressions.

MEMO example

Motion Prompting: Controlling Video Generation with Motion Trajectories

Motion Prompting can control video generation using motion paths. It allows for camera control, motion transfer, and drag-based image editing, producing realistic movements and physics.

Motion Prompting example

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Imagine360 can generate high-quality 360° videos from monologue single-view videos.

Imagine360 example

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Align3R can estimate depth maps, point clouds, and camera positions from single videos.

Align3R example

MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

MamKPD is a lightweight pose estimation framework that detects 2D keypoints in real time, achieving 1492 frames per second on an NVIDIA GTX 4090 GPU.

MamKPD example

One Shot, One Talk: Whole-body Talking Avatar from a Single Image

One Shot, One Talk can create a fully expressive whole-body talking avatar from a single image. It uses pose-guided image-to-video diffusion models for realistic animation.

One Shot, One Talk example

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

FLOAT can create talking portrait videos from a single image and audio file.

FLOAT example

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

Synergizing Motion and Appearance can generate high-quality talking head videos by combining facial identity from a source image with motion from a driving video.

Synergizing Motion and Appearance example

VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models

VISION-XL can deblur and upscale videos using SDXL. It supports different aspect ratios and can produce HD videos in under 2.5 minutes on a single NVIDIA 4090 GPU, using only 13GB of VRAM for 25-frame videos.

VISION-XL example

Also interesting

I started a new persona called @theamberman under which I post art more frequently

And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buying my Midjourney prompt collection on PROMPTCACHE 🚀
  • Buying access to AI Art Weekly Premium 👑

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa