AI Toolbox
A curated collection of 611 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
TC4D can animate 3D scenes generated from text along arbitrary trajectories. I can see this being useful for generating 3D effects for movies or games.
TRAM can reconstruct human motion and camera movement from videos in dynamic settings. It reduces global motion errors by 60% and uses a video transformer model to accurately track body motion.
Attribute Control enables fine-grained control over attributes of specific subjects in text-to-image models. This lets you modify attributes like age, width, makeup, smile and more for each subject independently.
FlashFace can personalize photos by using one or a few reference face images and a text prompt. It keeps important details like scars and tattoos while balancing text and image guidance, making it useful for face swapping and turning virtual characters into real people.
TRIP is a new approach to image-to-video generation with better temporal coherence.
Make-It-Vivid generates high-quality texture maps for 3D biped cartoon characters from text instructions, making it possible to dress and animate characters based on prompts.
ThemeStation can generate a variety of 3D assets that match a specific theme from just a few examples. It uses a two-stage process to improve the quality and diversity of the models, allowing users to create 3D assets based on their own text prompts.
Spectral Motion Alignment is a framework that can capture complex and long-range motion patterns within videos and transfer them to video-to-video frameworks like MotionDirector, VMC, Tune-A-Video, and ControlVideo.
StreamingT2V enables long text-to-video generations featuring rich motion dynamics without any stagnation. It ensures temporal consistency throughout the video, aligns closely with the descriptive text, and maintains high frame-level image quality. Videos can be up to 1200 frames, spanning 2 minutes, and can be extended for even longer durations.
ReNoise can be used to reconstruct an input image that can be edited using text prompts.
AnyV2V can edit videos using prompt-based editing and style transfer without fine-tuning. It modifies the first frame of a video and generates the edited video while keeping high visual quality.
FouriScale can generate high-resolution images from pre-trained diffusion models with various aspect ratios and achieve an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation.
FRESCO combines ControlNet with Ebsynth for zero-shot video translation that focuses on preserving the spatial and temporal consistency of the input frames.
You Only Sample Once can quickly create high-quality images from text in one step. It combines diffusion processes with GANs, allows fine-tuning of pre-trained models, and works well at higher resolutions without extra training.
TexDreamer can generate high-quality 3D human textures from text and images. It uses a smart fine-tuning method and a unique translator module to create realistic textures quickly while keeping important details intact.
AnimateDiff-Lightning can generate videos over ten times faster than AnimateDiff. It uses progressive adversarial diffusion distillation to combine multiple diffusion models into one motion module, improving style compatibility and achieving top performance in few-step video generation.
HoloDreamer can generate enclosed 3D scenes from text descriptions. It does so by first creating a high-quality equirectangular panorama and then rapidly reconstructing the 3D scene using 3D Gaussian Splatting.
InTeX can enable interactive text-to-texture synthesis for 3D content creation. It allows users to repaint specific areas and edit textures precisely, while a depth-aware inpainting model reduces 3D inconsistencies and speeds up generation.
StyleSketch is a method for extracting high-resolution stylized sketches from a face image. Pretty cool!
Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting can create high-quality 3D content from text prompts. It uses edge, depth, normal, and scribble maps in a multi-view diffusion model, enhancing 3D shapes with a unique hybrid guidance method.