AI Art Weekly #122

Hello, my fellow dreamers, and welcome to issue #122 of AI Art Weekly! 👋

AI progress continues to advance. In the last two weeks, we saw quite a few smaller and larger methods and models get open-sourced. 16/21 as a matter of fact, which is probably the highest it has ever been since I started tracking papers. It’s always great to see code actually getting released!

Meanwhile, I’ve started a new “ChatGPT” section on Promptcache in which I plan to add prompt ideas for creative multimodal image tasks that could be used for working on websites, designs, games, etc. I’ve added 8 so far, but I have a list of 50 ideas lying around which I’ll add as I go.

The next issue will again be in two weeks as I’m going to take some family time. Enjoy the weekend, everybody! ✌️


News & Papers

Highlights

MAGI-1

MAGI-1 is a new autoregressive video model that looks like it surpasses Wan-2.1 in quality. It supports:

  • High-Resolution Video: Generates videos at 720p resolution by default, with a 2x decoder variant supporting 1440p for sharper, cinematic-quality visuals suitable for professional content creation.
  • Up to 16-Second Clips: Produces video clips up to 16 seconds long at 24 FPS, with chunk-wise generation allowing seamless extension for longer narratives or interactive media.
  • Video Continuation (V2V): Extends existing video clips by predicting subsequent frames, maintaining motion continuity and context, ideal for storytelling or game cinematics.
  • Real-Time Streaming: Delivers video chunks in real-time, enabling live applications like interactive broadcasts or virtual environments.
  • Smooth Transitions with Second-by-Second Prompts: Supports fine-grained control via text prompts for each 1-second chunk, allowing precise scene changes (man smiling to man juggling) and controllable shot transitions while preserving object identity or scene layout.

It runs on a RTX 4090 (4.5B model) or 8xH100 (24B model) with optimized memory (21.94 GB peak for 4.5B). The code and weights for MAGI-1 are available on Hugging Face.

MAGI-1 example

Nari Dia-1.6B

Dia-1.6B is a new text-to-speech model that reportedly outperforms ElevenLabs in realistic dialogue generation. It supports:

  • Realistic Dialogue: Generates natural-sounding conversations from text scripts with [S1], [S2] tags for multiple speakers.
  • Non-Verbal Sounds: Produces sounds like laughter, coughs, sighs, and more using tags ((laughs), (coughs)).
  • Voice Cloning: Replicates a speaker’s voice from an audio prompt for consistent tone and emotion.
  • Real-Time Audio: Generates audio in real-time on enterprise GPUs (40 tokens/s on A4000, ~86 tokens = 1 second).
  • English-Only: Currently supports English dialogue generation.

It runs on CUDA supported GPUs with ~10GB VRAM with CPU support planned. The code and weights for Dia-1.6B are available on Hugging Face. Examples can be found here.

Dia prompt example

3D

TAPIP3D: Tracking Any Point in Persistent 3D Geometry

TAPIP3D can track 3D points in videos.

TAPIP3D example

CoMotion: Concurrent Multi-person 3D Motion

CoMotion can detect and track 3D poses of multiple people using just one camera. It works well in crowded places and can keep track of movements over time with high accuracy.

CoMotion example

PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

PartField can segment 3D shapes into parts without using templates or text names.

PARTFIELD example

HoloPart: Generative 3D Part Amodal Segmentation

HoloPart can break down 3D shapes into complete and meaningful parts, even if they are hidden. It also supports numerous downstream applications such as Geometry Editing, Geometry Processing, Material Editing and Animation.

HoloPart example

HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

HiScene can generate high-quality 3D scenes from 2D images by treating them as layered objects. It allows for interactive editing and effectively manages occlusions and shadows using a video-diffusion technique.

HiScene example

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Art3D can turn flat 2D designs into 3D images. It uses pre-trained 2D image models and a realism check to improve the 3D effect across different art styles.

Art3D example

Text

Describe Anything: Detailed Localized Image and Video Captioning

Describe Anything can generate detailed descriptions for specific areas in images and videos using points, boxes, scribbles, or masks.

Describe Anything example

Image

Step1X-Edit: A Practical Framework for General Image Editing

Step1X-Edit can perform advanced image editing tasks by processing reference images and user instructions.

Step1X-Edit example

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

InstantCharacter can generate high-quality images of personalized characters from a single reference image with FLUX. It supports different styles and poses, ensuring identity consistency and allowing for text-based edits.

InstantCharacter example

Shape-Guided Clothing Warping for Virtual Try-On

SCW-VTON can fit in-shop clothing to a person’s image while keeping their pose consistent. It improves the shape of the clothing and reduces distortions in visible limb areas, making virtual try-on results look more realistic.

Shape-Guided Clothing Warping for Virtual Try-On example

IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

IMAGGarment-1 can generate high-quality garments with control over shape, color, and logo placement.

IMAGGarment-1 example

Cobra: Efficient Line Art COlorization with BRoAder References

Cobra can efficiently colorize line art by utilizing over 200 reference images.

Cobra example

TryOffDiff: Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

TryOffDiff can generate high-quality images of clothing from photos of people wearing them.

Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off example

Video

SkyReels-V2: Infinite-length Film Generative Model

SkyReels-V2 can generate infinite-length videos by combining a Diffusion Forcing framework with Multi-modal Large Language Models and Reinforcement Learning.

the first 5 seconds of a 30 seconds SkyReels-V2 video example

Ev-DeblurVSR: Event-Enhanced Blurry Video Super-Resolution

Ev-DeblurVSR can turn low-resolution and blurry videos into high-resolution ones.

Event-Enhanced Blurry Video Super-Resolution example

FramePack: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

FramePack aims to make video generation feel like image gen. It can generate single video frames in 1.5 seconds with 13B models on a RTX 4090. Also supports full fps-30 with 13B models using a 6GB laptop GPU, but obviously slower.

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation example

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

UniAnimate-DiT can generate high-quality animations from human images. It uses the Wan2.1 model and a lightweight pose encoder to create smooth and visually appealing results, while also upscaling animations from 480p to 720p.

UniAnimate-DiT example

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

NormalCrafter can generate consistent surface normals from video sequences. It uses video diffusion models and Semantic Feature Regularization to ensure accurate normal estimation while keeping details clear across frames.

NormalCrafter example

3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

3DV-TON can generate high-quality videos for trying on clothes using 3D models. It handles complex clothing patterns and different body poses well, and it has a strong masking method to reduce errors.

3DV-TON example

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

RealisDance-DiT can generate high-quality character animations from images and pose sequences. It effectively handles challenges like character-object interactions and complex gestures while using minimal changes to the Wan-2.1 video model and is part of the Uni3C method.

RealisDance-DiT example

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Uni3C is a video generation method that adds support for both camera controls and human motion in video generation.

Uni3C example

Enjoy the weekend!

And that my fellow dreamers, concludes yet another AI Art weekly issue. If you like what I do, you can support me by:

  • Sharing it 🙏❤️
  • Following me on Twitter: @dreamingtulpa
  • Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
  • Buying my Midjourney prompt collection on PROMPTCACHE 🚀
  • Buying access to AI Art Weekly Premium 👑

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

by @dreamingtulpa