AI Toolbox
A curated collection of 610 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
SceneCraft can generate detailed indoor 3D scenes from user layouts and text descriptions. It is able to turn 3D layouts into 2D maps, producing complex spaces with diverse textures and realistic visuals.
TweedieMix can generate images and videos that combine multiple personalized concepts.
RFNet is a training-free approach that bring better prompt understanding to image generation. Adding support for prompt reasoning, conceptual and metaphorical thinking, imaginative scenarios and more.
FreeLong can generate 128 frame videos from short video diffusion models trained on 16 frame videos without requiring additional training. It’s not SOTA, but has just the right amount of cursedness 👌
Animate3D can animate any static multi-view 3D model.
VSTAR is a method that enables text-to-video models to generate longer videos with dynamic visual evolution in a single pass, without finetuning needed.
Hallo2 can create long, high-resolution (4K) animations of portrait images driven by audio. It allows users to adjust facial expressions with text labels, improving control and reducing issues like appearance drift and temporal artifacts.
Pyramidal Flow Matching can generate high-quality 5 to 10-second videos at 768p resolution and 24 FPS. It uses a unified pyramidal flow matching algorithm to link flows across different stages, making video creation more efficient.
Trans4D can generate realistic 4D scene transitions with expressive object deformation.
AvatarGO can generate 4D human-object interaction scenes from text. It uses LLM-guided contact retargeting for accurate spatial relations and ensures smooth animations with correspondence-aware motion optimization.
And because methods always come in pairs, GenN2N is another NeRF editing method. This one can edit scenes using text prompts, colorize, upscale and inpaint them.
SEMat can improve interactive image matting! It enhances network design and training to achieve better transparency, detail, and accuracy than methods like MAM and SmartMat.
UniMuMo can generate outputs across text, music, and motion. It achieves this by aligning unpaired music and motion data based on rhythmic patterns.
OmniBooth can generate images with precise control over their layout and style. It allows users to customize images using masks and text or image guidance, making the process flexible and personal.
EgoAllo can estimate 3D human body pose, height, and hand parameters using images from a head-mounted device.
While TripoSR can generate meshes from an image, MagicClay can edit them. It’s an artist-friendly tool that allows you to sculpt regions of a mesh with text prompts while keeping other regions untouched.
TCAN can animate characters of various styles from a pose guidance video.
GAGAvatar can create 3D head avatars from a single image and enable real-time facial expression reenactment.
Generative Radiance Field Relighting can relight 3D scenes captured under a single light source. It allows for realistic control over light direction and improves the consistency of views, making it suitable for complex scenes with multiple objects.
Time Reversal is making it possible to generate in-between frames of two input images. In particular, this enables the generation of looping cinemagraphs as well as camera and subject motion videos.