Hello there, my fellow dreamers, and welcome to issue #52 of AI Art Weekly! 👋
Another crazy week in AI lies behind us: ChatGPT goes multi-modal (more below), Tesla showed us a sneak peek of their autonomous humanoid robot Optimus, Meta announced their new AI powered Ray-Ban smart glasses, and Lex Friedman had a conversation with Marc Zuckerberg in the Metaverse as photorealistic avatars 🤯
Meanwhile I’m struck with the flu, so before we get on with this weeks highlights, I let GitHub Copilot finish this intro for me: “I’m sick and tired of being sick and tired” 😅
Here are the highlights:
- GPT-4 goes multi-modal
- DreamGaussian: Efficient 3D asset generation with Generative Gaussian Splatting
- TempoTokens turns audio into video
- Show-1 is a new memory efficient text-to-video model
- AnimeInbet generates inbetween frames for cartoon line drawings
- and more tutorials, tools and gems!
Cover Challenge 🎨
News & Papers
GPT-4 goes multi-modal
Just last week OpenAI announced that DALL·E 3 was going to build on top of ChatGPT. This week they announced that they’ll finally add vision (and voice) capabilities. This means you’ll be able to give ChatGPT an image and interact with it. Just imagine being able to talk to your art, well it’s going to be a reality in the next two weeks. I also wonder how the vision capabilities are going to affect image generatino with DALL·E, if they nail the aspect of editing images with natural language this might be a true game changer.
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
Generative 3D just got an upgrade. DreamGaussian is a new Gaussian Splatting method that is able to generate high-quality textured 3D meshes from text or a single image in just 2 minutes. That’s 10 times faster compared to NeRF.
RealFill: Reference-Driven Generation for Authentic Image Completion
Imagine you have a lot of similar photos of a memory, but none of them are perfect or show the whole picture. RealFill is able to solve that. Similar to how diffusion inpainting is working, RealFill can complete and extend an image based on similar reference images.
TempoTokens: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
While we’ve seen image- and video-to-audio, we haven’t seen much audio-to-video. TempoTokens is changing that. The method is able to generate videos based on an input sound. Quite impressive.
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Show-1 is a new text-to-video diffusion model that is able to produce high-quality videos of precise text-video alignment. Compared to pixel only video diffusion models, Show-1 is much more efficient and only requires 15G compared to 72G of GPU memory during inference.
AnimeInbet is a method that is able to generate inbetween frames for cartoon line drawings. Seeing this, we’ll hopefully be blessed with higher framerate animes in the near future.
More papers & gems
- Decaf: Monocular Deformation Capture for Face and Hand Interactions
- LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models
- VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning
- IDInvert: In-Domain GAN Inversion for Real Image Editing
- CCEdit: Creative and Controllable Video Editing via Diffusion Models
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buy a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!