Text-to-Image
Free text-to-image AI tools for creating visuals from text prompts, perfect for artists and designers in need of unique imagery.
X-Adapter can enable pretrained plugins like ControlNet and LoRA from Stable Diffusion 1.5 to work with the SDXL model without retraining. It adds trainable mapping layers for feature remapping and uses a null-text training strategy to improve compatibility and functionality.
Custom Diffusion can quickly fine-tune text-to-image diffusion models to generate new variations from just a few examples in about 6 minutes on 2 A100 GPUs. It allows for the combination of multiple concepts and requires only 75MB of storage for each additional model, which can be compressed to 5-15MB.
[The Chosen One] can generate consistent characters in text-to-image diffusion models using just a text prompt. It improves character identity and prompt alignment, making it useful for story visualization, game development, and advertising.
Latent Consistency Models can generate high-resolution images in just 2-4 steps, making text-to-image generation much faster than traditional methods. They require only 32 A100 GPU hours for training on a 768x768 resolution, which is efficient for high-quality results.
PIXART-α can generate high-quality images at a resolution of up to 1024px. It reduces training time to 10.8% of Stable Diffusion v1.5, costing about $26,000 and emitting 90% less CO2.
InstaFlow can generate high-quality images in just one step, achieving an FID of 23.3 on MS COCO 2017-5k. It works very fast at about 0.09 seconds per image, using much less computing power than traditional diffusion models.
Similar to ControlNet and Composer, IP-Adapter is a mutli-modal guidance adapter for image prompts which works with Stable Diffusion models trained on the same base model. The results look amazing.
AnimateDiff is a new framework that brings video generation to the Stable Diffusion pipeline. Meaning you can generate videos with any already existing Stable Diffusion models without having to fine-tune or train anything. Pretty amazing. @DigThatData put together a Google Colab notebook in case you want to give it a try.
Text2Cinemagraph can create cinemagraphs from text descriptions, animating elements like flowing rivers and drifting clouds. It combines artistic images with realistic ones to accurately show motion, outperforming other methods in generating cinemagraphs for natural and artistic scenes.
DiffSketcher is a tool that can turn words into vectorized free-hand sketches. The method also supports the ability to define the level of abstraction, allowing for more abstract or concrete generations.
Cocktail is a pipeline for guiding image generating. Compared to ControlNet, it only requires one generalized model for multiple modalities like Edge, Pose and Mask guidance.
There is a new text-to-image player called RAPHAEL in town. The model aims to generate highly artistic images, which accurately portray the text prompts, encompassing multiple nouns, adjectives, and verbs. This is all great, but only if someone actually releases the model for open-source consumption as the community is craving a model that can achieve Midjourney quality.
FastComposer can generate personalized images of multiple unseen individuals in various styles and actions without fine-tuning. It is 300x-2500x faster than traditional methods and requires no extra storage for new subjects, using subject embeddings and localized attention to keep identities clear.
Expressive Text-to-Image Generation with Rich Text can create detailed images from text by using rich text formatting like font style, size, and color. This method allows for better control over styles and colors, making it easier to generate complex scenes compared to regular text.
Inst-Inpaint can remove objects from images using natural language instructions, which saves time by not needing binary masks. It uses a new dataset called GQA-Inpaint, improving the quality and accuracy of image inpainting significantly.
eDiff-I can generate high-resolution images from text prompts using different diffusion models for each stage. It also allows users to control image creation by selecting and moving words on a canvas.
Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models can quickly personalize text-to-image models using just one image and only 5 training steps. This method reduces training time from minutes to seconds while maintaining quality through regularized weight-offsets.
Reduce, Reuse, Recycle can enable compositional generation using energy-based diffusion models and MCMC samplers. It improves tasks like classifier-guided ImageNet modeling and text-to-image generation by introducing new samplers that enhance performance.
[Tool Name] can [main function/capability]. It [key detail 1] and [key detail 2].
ControlNet can add control to text-to-image diffusion models. It lets users manipulate image generation using methods like edge detection and depth maps, while working well with both small and large datasets.