Hello there, my fellow dreamers, and welcome to issue #70 of AI Art Weekly! 👋
Was extremely busy this week experimenting with detection and tracking models for Shortie and found a solution that is fast and accurate enough. If things go well, I have an MVP up next week. Wish me luck! 🤞
In the meantime, let’s see what’s new in the world of Generative AI art!
- Video-LaVIT: a multi-modal LLM that can generate images and videos
- ConsistI2V generates image-to-video with more consistency
- Direct-a-Video controls camera movement and object motion for text-to-video
- Boximator generates rich and controllable motions for image-to-video
- ConsiStory maintains subject consistency in text-to-image
- LGM generates high-resolution 3D mesh objects
- Holo-Gen generates PBR material properties for 3D objects
- Stability AI has been working on a text-to-speech model
- EmoSpeaker generates talking-head videos
- Interview with AI artist blanq
- and more!
Cover Challenge 🎨
News & Papers
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Video-LaVIT is a multi-modal video-language method that can comprehend and generate image and video content and supports long video generation.
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
ConsistI2V is an image-to-video method with enhanced visual consistency. Compared to other methods, this one is able to better maintain the subject, background, and style from the first frame, as well as ensure a fluid and logical progression while supporting long video generation as well as camera motion control.
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion
In the controllability department we got Direct-a-Video. The framework can individually or jointly control camera movement and object motion in text-to-video generations. This means you can generate a video and tell the model to move the camera from left to right, zoom in or out and move objects around in the scene.
Boximator: Generating Rich and Controllable Motions for Video Synthesis
As usual, one paper seldom comes alone. Boximator is a method that can generate rich and controllable motions for image-to-video generations by drawing box constraints and motion paths onto the image.
ConsiStory: Training-Free Consistent Text-to-Image Generation
First InstantID, then StableIdentity and now ConsiStory, the third paper in 4 weeks that tries to consistent subject identity without fine-tuning. Compared to other methods, ConsiStory is able to successfully follow text prompts while maintaining subject consistency. The model also supports multi-subject scenarios and even enable training-free personalization for common objects.
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
LGM can generate high-resolution 3D mesh objects from text prompts or a single image. The model is able to generate 3D objects within 5 seconds while boosting the training resolution to 512, resulting in high-fidelity and efficient 3D content creation. There is a HuggingFace demo if you want to give it a try. It’s still not good enough to turn my PFP into a 3D model though 😢
Holo-Gen: Collaborative Control for Geometry-Conditioned PBR Image Generation
Now we got meshes, but what if we want to re-texture them? Unity has published Holo-Gen this week. The method can generate physically-based rendering (PBR) material properties for 3D objects.
Natural language guidance of high-fidelity text-to-speech models with synthetic annotations
Stability has been researching text-to-speech capabilities that let you control speaker identity and style with natural language text prompts. Their trained model is able to generate high-fidelity speech with a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions. It hasn’t been open-sourced yet, but I’m sure it will at some point.
EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
EmoSpeaker is yet another talking-head model. This one is able to generate talking-head videos with input audio, emotion, and a source image. It can also generate talking-heads of different emotional intensities by adjusting the fine-grained emotion.
- λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
- Minecraft-ify: Minecraft Style Image Generation with Text-guided Image Editing for In-Game Application
- NerfEmitter: NeRF as Non-Distant Environment Emitter in Physics-based Inverse Rendering
- Denoising Diffusion via Image-Based Rendering
- Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos
- InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!