Hello there, my fellow dreamers, and welcome to issue #55 of AI Art Weekly! 👋
AI developments are in full swing and I’ve another packed issue for you. Let’s jump in. The highlights this week are:
- New 2k and 4k Midjourney upscalers
- 3D-GPT: 3D modeling with large language models
- Progressive3D can do local edits on 3D assets
- DiffSketcher generates vectorized free-hand sketches
- 4D real time videos at 4k resolution
- LLMs can control human motion generations
- DynVideo-E can edit human-centric videos in 3D space
- PAIR Diffusion: A Comprehensive Multimodal Object-level Image Editor
- OIR-Diffusion can manipulate multiple objects in an image
- Separate Anything You Describe
- Training AI to play Pokémon
- and more tutorials, tools and gems!
Cover Challenge 🎨
News & Papers
New Midjourney Upscalers
Midjourney released two new image upscalers this week that can upscale images by a factor of 2 or 4. For instance square 1024x1024 images can now be upscaled to a resolution of 2048x2048 and 4096x4096.
3D-GPT: 3D modeling with large language models
So far it has been tough to imagine the benefits of AI agents. Most of what we’ve seen from that domain has been focused on NPC simulations or solving text-based goals. 3D-GPT is a new framework that utilizes LLMs for instruction-driven 3D modeling by breaking down 3D modeling tasks into manageable segments to procedurally generate 3D scenes. I recently started to dig into Blender and I pray this gets open sourced at one point.
Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts
Generating 3D assets is one thing, editing them another. Progressive3D can do both with a DALL·E 3 like level of prompt understanding. Especially its editing capabilities look wild and offer the ability to select different regions of an object with 2D masks and 3D bounding boxes to define the area which should be edited.
DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models
DiffSketcher is a tool that can turn words into vectorized free-hand sketches. The method also supports the ability to define the level of abstraction, allowing for more abstract or concrete generations.
4K4D: Real-Time 4D View Synthesis at 4K Resolution
If you were impressed with last weeks 4D-GS advancements, you’ll love 4K4D. The method improves upon Gaussian Spaltting and is able to render at over 400fps at 1080p resolution and 80fps at 4K resolution using an RTX 4090 GPU on common multi-view video datasets.
MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations
Also last week OmniControl showed us it’s possible to control human motion generations through spatial control signals. This week, MoConVQ shows us that motion frameworks combined with LLMs will be able to follow and complete complex and abstract tasks through text and voice instructions.
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
DynVideo-E is an interesting approach utilizing dynamic NeRFs to edit human-centric videos in 3D space and propagate the changes to the entire video. The results look stunning.
PAIR Diffusion: A Comprehensive Multimodal Object-level Image Editor
PAIR Diffusion is a generic framework that can enable a diffusion model to control the structure and appearance properties of each object in an image. This allows for various object-level editing operations on real images such as reference image-based appearance editing, free-form shape editing, adding objects, and variations.
OIR-Diffusion: Object-aware Inversion and Reassembly for Image Editing
OIR-Diffusion is yet another image editing method. This one enables object-level fine-grained editing and is able to change the shape, color, material, category and more of multiple objects in a single image.
Separate Anything You Describe
As someone who has experimented on audio-reactive music videos in the past, AudioSep might bring me back to it. The model is able to separate audio events, musical instruments, and even enhance speech with natural language queries which makes this a versatile tool for different audio tasks. A demo can be found on HuggingFace.
More papers & gems
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- SVC: Leveraging Content-based Features from Multiple Acoustic Models for Singing Voice Conversion
- CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation
- HumanTOMATO: Text-aligned Whole-body Motion Generation
- Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing
Tools & Tutorials
These are some of the most interesting resources I’ve come across this week.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buy a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!