AI Toolbox
A curated collection of 915 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





PicoAudio is a temporal controlled audio generation framework. The model is able to generate audio with precise timestamp and occurrence frequency control.
PartGLEE can locate and identify objects and their parts in images. The method uses a unified framework that enables detection, segmentation, and grounding at any granularity.
MIGC++ is a plug-and-play controller that enables Stable Diffusion with precise position control while ensuring the correctness of various attributes like color, shape, material, texture, and style. It can also control the number of instances and improve interaction between instances.
AniPortrait can generate high-quality portrait animations driven by audio and a reference portrait image. It also supports face reenactment from a reference video.
DiffIR2VR-Zero is a zero-shot video restoration method that can be used with any 2D image restoration diffusion model. The method is able to do 8x super-resolution and high-standard deviation video denoising.
DIRECTOR can generate complex camera trajectories from text that describe the relation and synchronization between the camera and characters.
FoleyCrafter can generate high-quality sound effects for videos! Results aim to be semantically relevant and temporally synchronized with a video. It also supports text prompts to better control the video-to-audio generation.
Motion Prompting can control video generation using motion paths. It allows for camera control, motion transfer, and drag-based image editing, producing realistic movements and physics.
StyleShot can mimic and style transfer various styles from an image, such as 3D, flat, abstract or even fine-grained styles, without tuning.
MimicMotion can generate high-quality videos of arbitrary length mimicking specific motion guidance. The method is able to produce videos of up to 10,000 frames with acceptable resource consumption.
AnyControl is a new text-to-image guidance method that can generate images from diverse control signals, such as color, shape, texture, and layout.
Text-Animator can depict the structures of visual text in generated videos. It supports camera control and text refinement to improve the stability of the generated visual text.
BRDF-Uncertainty can estimate the properties of the materials on an object’s surface in seconds given its geometry and a lighting environment.
MotionBooth can generate videos of customized subjects from a few images and a text prompt with precise control over both object and camera movements.
Director3D can generate real-world 3D scenes and adaptive camera trajectories from text prompts. The method is able to generate pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising.
MoMo is a new video frame interpolation method that is able to generate intermediate frames with high visual quality and reduced computational demands.
FreeTraj is a tuning-free approach that enables trajectory control in video diffusion models by modifying noise sampling and attention mechanisms.
Portrait3D can generate high-quality 3D heads with accurate geometry and texture from a single in-the-wild portrait image.
MIRReS can reconstruct and optimize the explicit geometry, material, and lighting of objects from multi-view images. The resulting 3D models can be edited and relit in modern graphics engines or CAD software.
LiveScene can identify and control multiple objects in complex scenes. It is able to locate individual objects in different states and enables control of them using natural language.