Text-to-Video
Free text-to-video AI tools for creating engaging video content from scripts, perfect for filmmakers, marketers, and content creators.
VideoElevator is a training-free and plug-and-play method that can be used to enhance temporal consistency and add more photo-realistic details of text-to-video models by using text-to-image models.
UniCtrl can improve the quality and consistency of videos made by text-to-video models. It enhances how frames connect and move together without needing extra training, making videos look better and more diverse in motion.
Video-LaVIT is a multi-modal video-language method that can comprehend and generate image and video content and supports long video generation.
VideoCrafter2 can generate high-quality videos from text prompts. It uses low-quality video data and high-quality images to improve visual quality and motion, overcoming data limitations of earlier models.
FreeInit can improve the quality of videos made by diffusion models without extra training. It fixes issues between training and use, making videos look better and more consistent.
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation can generate realistic and stable videos by separating spatial and temporal factors. It improves video quality by extracting motion and appearance cues, allowing for flexible content variations and better understanding of scenes.
Given one or more style references, StyleCrafter can generate images and videos based on these referenced styles.
Diffusion Motion Transfer is able to translate videos with a text prompt while maintaining the input video’s motion and scene layout.
LiveSketch can automatically add motion to a single-subject sketch by providing a text prompt indicating the desired motion. The output are short SVG animations which can be easily edited.
VideoDreamer is a framework that is able to generate videos that contain the given subjects and simultaneously conform to text prompts.
SEINE is a short-to-long video diffusion model that focuses on generative transitions and predictions. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of clips. The model can also be used for image-to-video animation and autoregressive video prediction.
FreeNoise is a method that can generate longer videos with up to 512 frames from multiple text prompts. That’s about 21 seconds for a 24fps video. The method doesn’t require any additional fine-tuning on the video diffusion model and only takes about 20% more time compared to the original diffusion process.
MotionDirector is a method that can train text-to-video diffusion models to generate videos with the desired motions from a reference video.
FLATTEN can improve the visual flow of edited videos by using optical flow in diffusion models. This method enhances the consistency of video frames without needing extra training.
LLM-grounded Video Diffusion Models can generate realistic videos from complex text prompts. They first create dynamic scene layouts with a large language model, which helps guide the video creation process, resulting in better accuracy for object movements and actions.
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation can generate diverse and realistic videos that match natural audio samples. It uses a lightweight adaptor network to improve alignment and visual quality compared to other methods.
Show-1 can generate high-quality videos with accurate text-video alignment. It uses only 15G of GPU memory during inference, which is much less than the 72G needed by traditional models.
Another video synthesis model that caught my eye this week is Reuse and Diffuse. The novel framework for text-to-video generation adds the ability to generate more frames from an initial video clip by reusing and iterating over the original latent features. Can’t wait to give this one a try.
While ZeroScope, Gen-2, PikaLabs and others have brought us high resolution text- and image-to-video, they all suffer from unsmooth video transition, crude video motion and action occurrence disorder. The new Dysen-VDM tries to tackle those issues, and while nowhere near perfect, delivers some promising results.
TokenFlow is a new video-to-video method for temporal coherent video editing with text. We’ve seen a lot of them, but this one looks extremely good with almost no flickering and requires no fine-tuning whatsoever.