AI Art Weekly #85
Hello there, my fellow dreamers, and welcome to issue #85 of AI Art Weekly! 👋
I’m gone for one week and it feels like all the AI researchers are publishing their papers at the same time. I went through 410 papers for you today, so be warned, this issue is huge! Enjoy 🙏
In this issue:
- 3D: WE-GS, NPGA, Text-Mesh-Refinement, Diff3DS, MultiPly, PuzzleFusion++, VividDream, GenWarp, ID-to-3D, PuTT, SuperGaussian, Unique3D, DIRECT-3D, Ouroboros3D, GECO, E3Gen, Physics3D, EASI-Tex
- 4D: Topo4D, Vidu4D, Sync4D, 4Diffusion
- Motion: MotionLLM, MoverseAI, Multi-Motion
- Image: BitsFusion, Packing Collage, L-MAGIC, pOps, MultiEdits, AnyFit, Flash Diffusion, Phased Consistency Model, BIRD, Stable-Pose, SketchDeco
- Video: ToonCrafter, CV-VAE, StreamV2V, Human4DiT, UniAnimate, MotionFollower, T2V-Turbo, SF-V, Follow-Your-Emoji, MOFA-Video, InstructAvatar
- and more!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do 🙏
Cover Challenge 🎨
For the next cover I’m looking for brutalism submissions! Reward is again $50 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here.
News & Papers
3D
WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections
WE-GS can reconstruct high-quality 3D Gaussian Splats scenes supporting dynamic lighting conditions from photo collections.
NPGA: Neural Parametric Gaussian Avatars
NPGA can create high-fidelity, controllable avatars from multi-view video recordings and animate the avatars using a single image or video as input.
Text-guided Controllable Mesh Refinement for Interactive 3D Modeling
Text-Mesh-Refinement can add geometric details to a coarse 3D mesh input with a text prompt. It first generates an image and then optimize the mesh to generate a fine, detailed geometry as output.
Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering
Diff3DS can generate view-consistent 3D sketches from text or images.
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
MultiPly can reconstruct multiple people in 3D from monocular in-the-wild videos. The results are pretty good and the method is able to handle occlusions and interactions between people.
PuzzleFusion++: Auto-agglomerative 3D Fracture Assembly by Denoise and Verify
PuzzleFusion++ is a new 3D fracture assembly method. It can take a bunch of broken 3D objects and automatically align and merge them into a single object.
VividDream: Generating 3D Scene with Ambient Dynamics
VividDream can generate explorable 4D scenes with ambient dynamics from a single image or text prompt. The method first expands an input image into a static 3D point cloud and then generates an ensemble of animated videos using video diffusion models. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics.
GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping
GenWarp can generate novel views from a single input image and preserve the semantics of the input image when generating new views. Also works with heavily stylized images.
ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling
ID-to-3D can generate personalized 3D human heads from a single image of a subject. It can accurately reconstruct not only facial features but also accessories and hair which can be meshed to provide render-ready assets.
Coarse-To-Fine Tensor Trains for Compact Visual Representations
PuTT is able to optimize the highly compact tensor train representation, making it possible to use them for image fitting, 3D fitting, and novel view synthesis.
SuperGaussian: Repurposing Video Models for 3D Super Resolution
SuperGaussian can upsample 3D models by adding geometric and appearance details by repurposing existing video models for 3D super-resolution.
Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image
Unique3D is yet another image-to-3D method. This one is able to generate high-quality 3D meshes with intricate textures and complex geometries from a single image.
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
DIRECT-3D can generate high-quality 3D objects from text prompts with accurate geometric details and various textures in 12 seconds on a single V100.
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
Ouroboros3D is another image-to-3D method that is able to generate high-quality 3D objects from a single image.
GECO: Generative Image-to-3D within a SECOnd
GECO can generate 3D objects from a single image in less than a second.
E3Gen: Efficient, Expressive and Editable Avatars Generation
E3Gen can generate diverse and expressive 3D avatars with full-body pose control and editing.
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
Physics3D can simulate a wide range of materials with high-fidelity capabilities and is able to predict the physical properties of materials and incorporate them into the behavior prediction process.
EASI-Tex: Edge-Aware Mesh Texturing from Single Image
EASI-Tex can texture 3D objects with the details of a single image while respecting their geometry.
4D
Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture
Topo4D is a new method for 4D head capture that can generate high-quality dynamic facial meshes and 8K textures from videos.
Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels
Vidu4D can reconstruct high-fidelity 4D representations from a single generated video. The method is able to capture motion and deformation over time and preserves fine-grained appearance details.
Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation
Sync4D can transfer the motion of objects from reference videos to a variety of generated 3D Gaussians! It supports diverse reference inputs including humans, quadrupeds, and articulated objects.
4Diffusion: Multi-view Video Diffusion Model for 4D Generation
4Diffusion can generate high-quality 4D scenes from a single video.
Motion
MotionLLM: Multimodal Motion-Language Learning with Large Language Models
MotionLLM can generate single-human, multi-human motions, and motion captions by fine-tuning pre-trained LLMs.
Towards Practical Single-shot Motion Synthesis
MoverseAI can mix and compose motions with a single forward pass and is up to 6.8x faster to train than other methods.
Towards Open Domain Text-Driven Synthesis of Multi-Person Motions
Multi-Motion can reconstruct natural and diverse group motions of multiple humans from a video input and textual descriptions.
Image
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
BitsFusion is a new weight quantization method that can quantize the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size (1.72GB vs 219MB) while exhibiting even better generation quality than the original one.
A Versatile Collage Visualization Technique
Packing Collage can pack geometric elements into a given shape. The method is highly efficient and can easily accommodate various loss functions, making it suitable for various visualization applications.
L-MAGIC: Language Model Assisted Generation of Images with Coherence
L-MAGIC can generate 360 degree panoramic scenes from a single input image and a text prompt. The method is able to diffuse multiple coherent views of the scene and can also accept other input modalities, such as depth maps, sketches, and colored scripts.
pOps: Photo-Inspired Diffusion Operators
pOps can learn specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior and can be used to apply a variety of photo-inspired effects to images.
MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models
MultiEdits can make simultaneous edits across multiple objects or attributes given a single text prompt.
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario
AnyFit is a virtual try-on method that can generate high-fidelity and robust fitting images across various scenarios.
Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation
Flash Diffusion can generate high-quality images with as few as 5 steps and is compatible with various tasks such as text-to-image, inpainting, face-swapping, and super-resolution.
Phased Consistency Model
PCM is a new consistency model that is specifically designed for multi-step image and video generation. It can generate high-resolution images and videos with up to 16 steps and achieves superior or comparable 1-step generation results to previous methods like LCM.
Blind Image Restoration via Fast Diffusion Inversion
BIRD can restore images from Gaussian blur, motion blur, and JPEG compression artifacts.
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation
Stable-Pose is a new method for pose-guided text-to-image generation outperforming ControlNet.
SketchDeco: Decorating B&W Sketches with Colour
SketchDeco can turn black and white sketches, masks, and colour palettes into realistic image without a user-defined text prompt.
Video
ToonCrafter: Generative Cartoon Interpolation
ToonCrafter can generate in-between frames for animations and allows users to control the interpolation process by providing images of keyframes.
CV-VAE: A Compatible Video VAE for Latent Generative Video Models
CV-VAE is a compatible video VAE for latent generative video models. With it, existing video models can generate four times more frames with minimal finetuning.
Looking Backward: Streaming Video-to-Video Translation with Feature Banks
StreamV2V is a new video-to-video method that can translate videos in real-time with user prompts. The method is able to run at 20 FPS on a single A100 GPU and is able to maintain temporal consistency. Also works great for text-to-image streaming.
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer
Human4DiT can generate high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints.
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
UniAnimate can animate a single image with a sequence of desired movement poses and is able to generate highly consistent videos with a length of up to one minute.
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion
MotionFollower can edit video motion while preserving the original protagonist’s appearance and background.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
T2V-Turbo is a new video consistency model that can generate videos from text in just 4 steps.
SF-V: Single Forward Video Generation Model
SF-V is a single-step video generation model that can be used to generate high-quality videos with both temporal and spatial dependencies. The model is able to achieve real-time video synthesis and editing.
Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation
Follow-Your-Emoji can animate a reference portrait with target landmark sequences. The method is able to control the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals.
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
MOFA-Video can generate videos from a single image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations.
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
InstructAvatar can generating emotionally expressive 2D avatars from an image and text prompt. The model is able to control the emotion as well as the facial motion of avatars.
Also interesting
- Part123: Part-aware 3D Reconstruction from a Single-view Image
- RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
- RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
- DiffCut: Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features
- MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
- Matching Anything by Segmenting Anything
- Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching
The submissions for Claire’s “back to school” AI contest are in. I compiled a small thread with some of my favourites. Give them some love 🧡
@Martin_Haerlin made a very cool AI short using his customised and self-made workflows.
@techhalla shared his retro video game prompt: 2.5D retro video game screenshot | first person POV | showing hands holding a [object] | in front of a [place/subject] | pixelated [style] graphics --ar 16:9 --s 50 --v 6.0
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa