Video AI Tools
Free video AI tools for editing, generating animations, and analyzing footage, perfect for filmmakers and content creators seeking efficiency.
Ground-A-Video can edit multiple attributes of a video using pre-trained text-to-image models without any training. It maintains consistency across frames and accurately preserves non-target areas, making it more effective than other editing methods.
LLM-grounded Video Diffusion Models can generate realistic videos from complex text prompts. They first create dynamic scene layouts with a large language model, which helps guide the video creation process, resulting in better accuracy for object movements and actions.
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation can generate diverse and realistic videos that match natural audio samples. It uses a lightweight adaptor network to improve alignment and visual quality compared to other methods.
Show-1 can generate high-quality videos with accurate text-video alignment. It uses only 15G of GPU memory during inference, which is much less than the 72G needed by traditional models.
ProPainter is a new video inpainting method that is able to remove objects, complete masked videos, remove watermarks and even expand the view of a video.
Another video synthesis model that caught my eye this week is Reuse and Diffuse. The novel framework for text-to-video generation adds the ability to generate more frames from an initial video clip by reusing and iterating over the original latent features. Can’t wait to give this one a try.
Hierarchical Masked 3D Diffusion Model for Video Outpainting can fill in missing parts at the edges of video frames while keeping the motion smooth. It uses a smart method that reduces errors and improves results by looking at multiple frames.
While ZeroScope, Gen-2, PikaLabs and others have brought us high resolution text- and image-to-video, they all suffer from unsmooth video transition, crude video motion and action occurrence disorder. The new Dysen-VDM tries to tackle those issues, and while nowhere near perfect, delivers some promising results.
StableVideo is yet another vid2vid method. This one is not just a style transfer though, the method is able to differentiate between fore- and background when editing a video, making it possible to reimagine the subject within an entirely different landscape.
CoDeF can process videos consistently by using a canonical content field to gather static content and a temporal deformation field to track changes over time. This allows it to perform tasks like video-to-video translation and track moving objects, such as water and smog, without needing extra training.
TokenFlow is a new video-to-video method for temporal coherent video editing with text. We’ve seen a lot of them, but this one looks extremely good with almost no flickering and requires no fine-tuning whatsoever.
VideoComposer can generate videos with control over how they look and move using text, sketches, and motion vectors. It improves video quality by ensuring frames match well, allowing for flexible video creation and editing.
Make-Your-Video can generate customized videos from text and depth information for better control over content. It uses a Latent Diffusion Model to improve video quality and reduce the need for computing power.
Control-A-Video can generate controllable text-to-video content using diffusion models. It allows for fine-tuned customization with edge and depth maps, ensuring high quality and consistency in the videos.
Make-A-Protagonist can edit videos by changing the protagonist, background, and style using text and images. It allows for detailed control over video content, helping users create unique and personalized videos.
HumanRF can capture high-quality full-body human motion from multiple video angles. It allows playback from new viewpoints at 12 megapixels and uses a 4D dynamic neural scene representation for smooth and realistic motion, making it great for film and gaming.
[Sketching the Future] can generate high-quality videos from sketched frames using zero-shot text-to-video generation and ControlNet. It smoothly fills in frames between sketches to create consistent video content that matches the user’s intended motion.
Total-Recon can render scenes from monocular RGBD videos from different camera angles, like first-person and third-person views. It creates realistic 3D videos of moving objects and allows for 3D filters that add virtual items to people in the scene.
DreamPose can generate animated fashion videos from a single image and a sequence of human body poses. The method is able to capture both human and fabric motion and supports a variety of clothing styles and poses.
Follow Your Pose can generate character videos that match specific poses from text descriptions. It uses a two-stage training process with pre-trained text-to-image models, allowing for continuous pose control and editing.