Hello there, my fellow dreamers, and welcome to issue #66 of AI Art Weekly! 👋
This week, OpenAI introduced their GPT Store, featuring an upcoming revenue program for US creators, while Rabbit unveiled the r1 pocket companion, a new mobile device that, with the aid of Large Action Models (LAM), aims to help you achieve more with fewer apps. Both have been met with considerable hype and skepticism. Meanwhile, reality is shifting, and the line between what is real and fake is becoming increasingly blurred. Let’s dive in:
- A new text-to-video model by ByteDance (TikTok)
- ReplaceAnything can, well, replace anything (in images)
- PALP is a new text-to-image fine-tuning approach by Google
- Dubbing for Everyone is a new method for visual dubbing
- FMA-Net can turn blurry, low-quality videos into clear, high-quality ones
- Audio2Photoreal can generate gesturing photorealistic avatars from sound clips
- 3 different 3D NeRF scene editing methods
- SonicVisionLM generates sound effects for silent videos
- and more!
Cover Challenge 🎨
News & Papers
MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
ByteDance (the TikTok company) announced a new text-to-video model called MagicVideo-V2. Their model is able to generate videos with up to 94 frames, resulting in a 1048×1048 resolution video that exhibits both high aesthetic quality and temporal smoothness. Definitely interesting to see where ByteDance is going with this, as they have one of the biggest datasets to train video models.
ReplaceAnything as you want: Ultra-high quality content replacement
ReplaceAnything is an “inpainting” framework that can be used for human replacement, clothing replacement, background replacement, and more. The results look crazy good. Code hasn’t been released yet, but there is a demo on HuggingFace.
PALP: Prompt Aligned Personalization of Text-to-Image Models
PALP is a new text-to-image fine-tuning approach by Google which focuses on personalization methods for a single prompt. The results compared to other methods look great and it supports art inspired, single-image and multi-subjects personalization.
Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors
Dubbing for Everyone is a new method for visual dubbing that is able to generate lip motions of an actor in a video to synchronize with given audio using as little as 4 seconds of data. The method is able to dub any video to any audio without further training and is able to capture person-specific characteristics and reduce visual artifacts.
FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring
FMA-Net can turn blurry, low-quality videos into clear, high-quality ones by accurately predicting the degradation and restoration processes, considering the movement in the video through advanced learning of motion patterns.
Audio2Photoreal: From Audio to Photoreal Embodiment
Audio2Photoreal can generate full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, the model is able to output multiple possibilities of gestural motion for an individual, including face, body, and hands. The results are highly photorealistic avatars that can express crucial nuances in gestures such as sneers and smirks.
InseRF and GO-Nerf: Inserting 3D Objects into Neural Radiance Fields
Even though Gaussian Splats have seen a lot of love, NeRFs haven’t been abandoned. This week we got three different NeRF editing papers. The first two are about inpainting. InseRF and GO-NeRF are both methods to insert 3D objects into NeRF scenes.
FPRF: Feed-Forward Photorealistic Style Transfer of Large-Scale 3D Neural Radiance Fields
The third is about style transfering. FPRF is able to stylize large-scale 3D NeRF scenes with multiple reference images without additional optimization while preserving multi-view appearance consistency.
SonicVisionLM: Playing Sound with Vision Language Models
SonicVisionLM can generate sound effects for videos, but compared to other methods, it uses vision language models (VLMs) to identify events within videos and generate sounds that match the video content.
- Jump Cut Smoothing for Talking Heads
- MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!