AI Art Weekly #99
Hello there, my fellow dreamers, and welcome to issue #99 of AI Art Weekly! 👋
Went through another 452 papers for you in the last two weeks and there are some pretty cool innovations, but the no-code issue keeps getting worse! Wanted to ship a feature for this today, but I’m attending a wedding 🤵♂️👰 in an hour so I’ll have to postpone that until next week 🤞
Also added a bunch of new high-quality Midjourney styles to Promptcache (now at 130+ styles) as well as a Prompt Generator (although it still needs some instructions on how to use it 😅, hint: type [
).
Anyway, enjoy your weekend and talk to you next week!
Unlock the full potential of AI-generated art with my curated collection of Midjourney SREF codes and prompts.
Cover Challenge 🎨
News & Papers
Highlights
We, Robot: Optimus, Robotaxi and Robovan
Today I woke up and thought I just got dropped into the iRobot timeline. Yesterday, Tesla unveiled their vision for humanities autonomous future:
- Tesla Bot (Optimus): A humanoid robot for household chores and errands
- Robotaxi: An autonomous vehicle for personal errands and commuting
- Robovan: Autonomous transport for groups and goods
Granted, nothing of the above is available yet, and the robots are most likely teleoperated (for now), but it’s still a fascinating glimpse into the future, even though this could all go horribly wrong 🤖🔥
3D
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis
Trans4D can generate realistic 4D scene transitions with expressive object deformation.
AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation
AvatarGO can generate 4D human-object interaction scenes from text. It uses LLM-guided contact retargeting for accurate spatial relations and ensures smooth animations with correspondence-aware motion optimization.
UniMuMo: Unified Text, Music and Motion Generation
UniMuMo can generate outputs across text, music, and motion. It achieves this by aligning unpaired music and motion data based on rhythmic patterns.
EgoAllo: Estimating Body and Hand Motion in an Ego-sensed World
EgoAllo can estimate 3D human body pose, height, and hand parameters using images from a head-mounted device.
SynTalker: Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
SynTalker can generate realistic full-body motions that match speech and text prompts. It allows precise control of movements, like talking while walking.
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control
DART can generate high-quality human motions in real-time, achieving over 300 frames per second on a single RTX 4090 GPU. It combines text inputs with spatial constraints, allowing for tasks like reaching waypoints and interacting with scenes.
CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control
CLoSD can control characters in physics-based simulations using text prompts. It can navigate to goals, strike objects, and switch between sitting and standing, all guided by simple instructions.
Dessie: Disentanglement for Articulated 3D Horse Shape and Pose Estimation from Images
Dessie can estimate the 3D shape and pose of horses from single images. It also works with other large animals like zebras and cows.
FabricDiffusion: High-Fidelity Texture Transfer for 3D Garments Generation from In-The-Wild Clothing Images
FabricDiffusion can transfer high-quality fabric textures from a 2D clothing image to 3D garments of any shape.
AniSDF: Fused-Granularity Neural Surfaces with Anisotropic Encoding for High-Fidelity 3D Reconstruction
AniSDF can reconstruct high-quality 3D shapes with improved surface geometry. It can handle complex, luminous, reflective as well as fuzzy objects.
Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation
Flex3D can generate high-quality 3D assets from single images or text prompts.
DressRecon: Freeform 4D Human Reconstruction from Monocular Video
DressRecon can create 3D human body models from single videos. It handles loose clothing and objects well, achieving high-quality results by combining general human shapes with specific video movements.
EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation
EdgeRunner can generate high-quality 3D meshes with up to 4,000 faces at a spatial resolution of 512 from images and point-clouds.
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
Disco4D can generate and animate 4D human models from a single image by separating clothing from the body. It uses diffusion models for detailed 3D representations and can model parts that are not visible in the input image.
Image
SEMat: Towards Natural Image Matting in the Wild via Real-Scenario Prior
SEMat can improve interactive image matting! It enhances network design and training to achieve better transparency, detail, and accuracy than methods like MAM and SmartMat.
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
OmniBooth can generate images with precise control over their layout and style. It allows users to customize images using masks and text or image guidance, making the process flexible and personal.
FLUX Image Restoration: Learning Efficient and Effective Trajectories for Differential Equation-based Image Restoration
FLUX-IR can restore low-quality images to high-quality ones by optimizing paths through reinforcement learning.
ControlAR: Controllable Image Generation with Autoregressive Models
ControlAR adds controls like edges, depths, and segmentation masks to autoregressive models like LlamaGen.
DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation
DisEnvisioner can generate customized images from a single visual prompt and extra text instructions. It filters out irrelevant details and provides better image quality and speed without needing extra tuning.
FreeEdit: Mask-free Reference-based Image Editing with Multi-modal Instruction
FreeEdit can edit images by adding, replacing, or removing objects without needing manual masks. It uses a special method called Decoupled Residual ReferAttention to improve detail from reference images.
Video
Pyramid Flow: Pyramidal Flow Matching for Efficient Video Generative Modeling
Pyramidal Flow Matching can generate high-quality 5 to 10-second videos at 768p resolution and 24 FPS. It uses a unified pyramidal flow matching algorithm to link flows across different stages, making video creation more efficient.
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
PhysGen can generate realistic videos from a single image and user-defined conditions, like forces and torques. It combines physical simulation with video generation, allowing for precise control over dynamics.
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes
MimicTalk can generate personalized 3D talking faces in under 15 minutes. It mimics a person’s talking style using a special audio-to-motion model, resulting in high-quality videos.
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
ViBiDSampler can generate high-quality frames between two keyframes using a bidirectional sampling strategy. It can create 25 frames at 1024x576 resolution in just 195 seconds on a single 3090 GPU, making it a top choice for keyframe interpolation.
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
TweedieMix can generate images and videos that combine multiple personalized concepts.
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher’s Guide
VideoGuide can improve the quality of videos made by text-to-video models without needing extra training. It enhances the smoothness of motion and clarity of images, making the videos more coherent and visually appealing.
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation
TANGO can generate high-quality body-gesture videos that match speech audio from a single video. It improves realism and synchronization by fixing audio-motion misalignment and using a diffusion model for smooth transitions.
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
MonST3R can estimate 3D shapes from videos over time, creating a dynamic point cloud and tracking camera positions. This method improves video depth estimation and separates moving from still objects more effectively than previous techniques.
Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Loong can generate minute-long videos by treating text and video tokens as a single sequence.
Inverse Painting: Reconstructing The Painting Process
Inverse Painting can generate time-lapse videos of the painting process from a target artwork. It uses a diffusion-based renderer to learn from real artists’ techniques, producing realistic results across different artistic styles.
Stable Video Portraits
Stable Video Portraits can generate photorealistic videos of talking faces by using a text-to-image model and 3D Morphable Models (3DMM). It creates person-specific avatars that can be transformed into text-defined celebrities, producing smooth and high-quality videos without extra fine-tuning.
Audio
Presto!: Distilling Steps and Layers for Accelerating Music Generation
Presto! can generate 32 seconds of high-quality music in 230ms, making it the fastest option for text-to-music generation.
Also interesting
Interested in building AI apps? @atroyn created an easily digestable resource for getting started with building AI applications using LLMs.
@flngr created a Hugging Face Space that lets you modify faces in images by simply dragging and dropping facial features.
@Martin_Haerlin put together a tutorial on how he replaced the red car in Baby Driver with a green ID.3.
@thedorbrothers used a version of The Office using Hailuo AI featuring Elon Musk, Obama, Trump, Hillary Clinton, Putin, Zuckerberg and more :)
@kentskooking created a rotating Mario made of 2,500 looping NES videos. The individual video tiles are made with ComfyUI and a hypernetwork trained on NES screenshots.
@mind_wank made this a while ago but never got around releasing it. Now it’s here. And it’s great.
@PJaccetturo recreated the Princess Mononoke trailer as a live action movie using AI.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying my Midjourney prompt collection on PROMPTCACHE 🚀
- Buying a print of my art from my art shop. You can request any of my artworks to be printed, just reply to this email.
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa