AI Art Weekly #108
Hello my fellow dreamers, and welcome to issue #108 of AI Art Weekly! 👋
Whew, what an issue to end the year with! After skimming through 562 papers in the last two weeks, I’ve packed this final newsletter of 2024 with 47 groundbreaking highlights for you. From video models to 3D generation techniques, we’ve come a long way this year. The newsletter is quite long, so if it’s not rendering properly, check out the web version.
I’ll be taking a short break until January to recharge and spend some time with family and play with my already four-month-old (🤯) Mini-Tulpa.
Thank you all for being part of this journey - your support means the world to me! Have fantastic holidays, and I’ll catch you in 2025 with more AI art breakthroughs! 🎄✨
Support the newsletter and unlock the full potential of AI-generated art with my curated collection of 200+ high-quality Midjourney SREF codes and 1000+ creative prompts. Also curated by me.
Cover Challenge 🎨
News & Papers
Highlights
Veo 2
Google dropped a surprise bomb this week with their new video model Veo 2. From the first previews and user tests, it looks set to be the new state of the art in video generation. The waitlist is open if you want to try it out.
Sadly for us Europeans, the EU AI Act means we can’t play with it yet :(
Sora
After teasing us since February, OpenAI finally launched Sora. The reviews? Pretty mixed. Its basic text-to-video and image-to-video stuff isn’t quite hitting the mark they showed earlier. But where it really shines is in the cool extras - blending videos together, making perfect loops, remixing clips, and upscaling them to look even better.
You can get it with your ChatGPT subscription, but EU folks are out of luck for now.
Pika 2.0
Pika just dropped a new major version of their video model Pika 2.0. The model comes with a new feature called ingredients that lets you mix and match different images - like a person, some fashion, and a background and combines them all together in one video.
Best part? They’re offering unlimited generations for free until the end of the week, and everyone can join in!
Midjourney Moodboards
While Midjourney’s video model is still in the works, they’ve rolled out something pretty neat: Moodboards. It’s a super simple way to create your own style codes in Midjourney. Keep an eye out for some exciting Promptcache updates coming soon 😉
3D
PRM: Photometric Stereo based Large Reconstruction Model
PRM can create high-quality 3D meshes from a single image using photometric stereo techniques. It improves detail and handles changes in lighting and materials, allowing for features like relighting and material editing.
Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation
Tactile DreamFusion can improve 3D asset generation by combining high-resolution tactile sensing with diffusion-based image priors. Supports both text-to-3D and image-to-3D generation.
MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation
MCMat can generate multi-view physically-based rendering (PBR) materials.
Motion-2-to-3: Leveraging 2D Motion Data to Boost 3D Motion Generation
Motion-2-to-3 can generate realistic 3D human motions from text prompts using 2D motion data from videos. It improves motion diversity and efficiency by predicting consistent joint movements and root dynamics with a multi-view diffusion model.
Wonderland: Navigating 3D Scenes from a Single Image
Wonderland can generate high-quality 3D scenes from a single image using a camera-guided video diffusion model. It allows for easy navigation and exploration of 3D spaces, performing better than other methods, especially with images it hasn’t seen before.
MeshArt: Generating Articulated Meshes with Structure-guided Transformers
MeshArt can generate 3D meshes with clean shapes.
Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors
Illusion3D can generate 3D multiview illusions from text prompts or images.
Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale
Meshtron can generate detailed 3D meshes with up to 64K faces at a high resolution.
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
SimAvatar can generate 3D human avatars from text prompts, creating realistic motion and detailed textures for hair and clothing.
BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation
BLADE can recover 3D human meshes from a single image by estimating perspective projection parameters.
PrEditor3D: Fast and Precise 3D Shape Editing
PrEditor3D can edit 3D shapes quickly and accurately using text prompts and rough masks. It allows users to keep unwanted areas unchanged and can edit a single shape in minutes without needing training.
ShapeCraft: Body-Aware and Semantics-Aware 3D Object Design
ShapeCraft can generate 3D objects that fit the human body from a base mesh using body shapes and guidance from text, images, or sketches. It ensures these objects work well in virtual settings and can be made for real-life use.
Image
ColorFlow: Retrieval-Augmented Image Sequence Colorization
ColorFlow can colorize black and white line-art and manga panels while keeping characters and objects consistent.
InvSR can upscale images in one to five steps. It achieves great results even with just one step, making it efficient for improving images in real-world situations.
Leffa: Learning Flow Fields in Attention for Controllable Person Image Generation
Leffa can generate person images based on reference images, allowing for precise control over appearance and pose.
TryOffAnyone can generate high-quality images of clothing on models from photos.
FireFlow is FLUX-dev editing method that can perform fast image inversion and semantic editing with just 8 diffusion steps.
MV-Adapter: Multi-view Consistent Image Generation Made Easy
MV-Adapter can generate images from multiple views while keeping them consistent across views. It enhances text-to-image models like Stable Diffusion XL, supporting both text and image inputs, and achieves high-resolution outputs at 768x768.
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models
FlowEdit can edit images using only text prompts with Flux and Stable Diffusion 3.
UnZipLoRA: Separating Content and Style from a Single Image
UnZipLoRA can break down an image into its subject and style. This makes it possible to create variations and apply styles to new subjects.
FashionComposer: Compositional Fashion Image Generation
FashionComposer can generate fashion images using text prompts, garment images, and human models.
Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
Pattern Analogies can edit patterns. Not sure what else to say really 😅
InstructMove: Instruction-based Image Manipulation by Watching How Things Move
InstructMove can manipulate images based on instructions from users. It allows for complex edits like changing subject poses and rearranging elements while keeping the content consistent and enabling precise adjustments with masks.
InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention
InstantRestore can restore badly damaged face images in near real-time. It uses a single-step image diffusion model and a small set of reference images to keep the person’s identity.
ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration
ReF-LDM can restore low-quality face images by using multiple high-quality reference images.
PanoDreamer: 3D Panorama Synthesis from a Single Image
PanoDreamer can generate 360° 3D scenes from a single image by creating a panoramic image and estimating its depth. It effectively fills in missing parts and projects them into 3D space.
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
LayerFusion can generate images with dynamic interactions between foreground (RGBA) and background (RGB) layers. It enables seamless blending, preserves transparency, and enhances visual coherence, making it great for graphic design and digital art.
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
SwiftEdit can edit images quickly using text prompts in just 0.23 seconds.
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
AnyDressing can customize characters with any combination of clothes and text prompts.
Video
AniDoc: Animation Creation Made Easier
AniDoc can automate the colorization of line art in videos and create smooth animations from simple sketches.
FCVG: Generative Inbetweening through Frame-wise Conditions-Driven Video Generation
FCVG can create smooth video transitions between two key frames. It improves stability by defining clear paths for movement and matching lines from the input frames, ensuring coherent changes even with fast motion.
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
DisPose can generate high-quality human image animations from sparse skeleton pose guidance.
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
SynCamMaster can generate videos from different viewpoints while keeping the look and shape consistent. It improves text-to-video models for multi-camera use and allows re-rendering from new angles.
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
ObjCtrl-2.5D enables object control in image-to-video generation using 3D trajectories from 2D inputs with depth information.
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
3DTrajMaster can control the 3D motions of multiple objects in videos using user-defined 6DoF pose sequences.
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
VividFace can swap faces in videos while keeping the original person’s look and expressions. It handles challenges like keeping the face consistent over time and working well with different angles and lighting.
SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device
SnapGen-V can generate a 5-second video on an iPhone 16 Pro Max in just 5 seconds. It uses a compact model with 0.6 billion parameters, making it much faster than traditional server-side models that take minutes, while still keeping good quality.
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
LinGen can generate high-resolution minute-length videos on a single GPU.
CausVid: From Slow Bidirectional to Fast Causal Video Generators
CausVid can generate high-quality videos at 9.4 frames per second on a single GPU. It supports text-to-video, image-to-video, and dynamic prompting while reducing latency with a causal transformer architecture.
StyleMaster: Stylize Your Video with Artistic Generation and Translation
StyleMaster can stylize videos by transferring artistic styles from images while keeping the original content clear.
Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training
Latent-Reframe can control cameras in video diffusion models without extra training. It adjusts latent codes during sampling to match camera movements, achieving high-quality video generation similar to methods that require training.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Mind the Time can generate multi-event videos with precise control over the timing of each event.
Diffusion VAS: Using Diffusion Priors for Video Amodal Segmentation
Diffusion VAS can generate masks for hidden parts of objects in videos.
MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos
MEGASAM can estimate camera parameters and depth maps from casual monocular videos.
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
INFP can create interactive agent videos from audio and a single portrait image. It enables lifelike facial expressions and head movements, supporting real-time communication at over 40 fps.
IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation
IF-MDM can generate high-quality talking head videos in real-time from a single image and audio input. It achieves a resolution of 512x512 at up to 45 frames per second and allows control over motion intensity and video quality.
Audio
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
Sketch2Sound can generate high-quality sounds using control signals like loudness, brightness, and pitch, along with text prompts. It allows sound artists to create sounds from vocal imitations or reference shapes.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying my Midjourney prompt collection on PROMPTCACHE 🚀
- Buying access to AI Art Weekly Premium 👑
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa