AI Toolbox
A curated collection of 959 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.
GARField can break down 3D scenes into meaningful groups. It improves the accuracy of object clustering and allows for better extraction of individual objects and their parts.
VideoCrafter2 can generate high-quality videos from text prompts. It uses low-quality video data and high-quality images to improve visual quality and motion, overcoming data limitations of earlier models.
RoHM can reconstruct complete, plausible 3D human motions from monocular videos with support for recognizing occluded joints! So, basically motion tracking on steroids but without the need for an expensive setup.
Motion tracking is one thing, generating motion from text another. STMC is a method that can generate 3D human motion from text with multi-track timeline control. This means that instead of a single text prompt, users can specify a timeline of multiple prompts with defined durations and overlaps to create more complex and precise animations.
Real3D-Portrait is a one-shot 3D talking portrait generation method. This one is able to generate realistic videos with natural torso movement and switchable backgrounds.
InstantID is a ID embedding-based method that can be used to personalize images in various styles using just a single facial image, while ensuring high fidelity.
FlexGen can generate high-quality, multi-view images from a single-view image or text prompt. It lets users change unseen areas and adjust material properties like metallic and roughness, improving control over the final image.
FMA-Net can turn blurry, low-quality videos into clear, high-quality ones by accurately predicting the degradation and restoration processes, considering the movement in the video through advanced learning of motion patterns.
MagicDriveDiT can generate high-resolution street scene videos for self-driving cars.
Audio2Photoreal can generate full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, the model is able to output multiple possibilities of gestural motion for an individual, including face, body, and hands. The results are highly photorealistic avatars that can express crucial nuances in gestures such as sneers and smirks.
MoonShot is a video generation model that can condition on both image and text inputs. The model is also able to integrate with pre-trained image ControlNet modules for geometry visual conditions, making it possible to generate videos with specific visual appearances and structures.
SIGNeRF is a new approach for fast and controllable NeRF scene editing and scene-integrated object generation. The method is able to generate new objects into an existing NeRF scene or edit existing objects within the scene in a controllable manner by either proxy object placement or shape selection.
En3D can generate high-quality 3D human avatars from 2D images without needing existing assets.
Auffusion is a Text-to-Audio system that is able to generate audio from natural language prompts. The model is able to control various aspects of the audio, such as acoustic environment, material, pitch, and temporal order. It can also generate audio based on labels or be combined with an LLM model to generate descriptive audio prompts.
DreamGaussian4D can generate animated 3D meshes from a single image. The method is able to generate diverse motions for the same static model and do that in 4.5 minutes instead of several hours compared to other methods.
Spacetime Gaussian Feature Splatting is a novel dynamic scene representation that is able to capture static, dynamic, as well as transient content within a scene and can render them at 8K resolution and 60 FPS on an RTX 4090.
PIA is a method that can animate images generated by custom Stable Diffusion checkpoints with realistic motions based on a text prompt.
RelightableAvatar is another method that can create relightable and animatable neural avatars from monocular video.
Intrinsic Image Diffusion can generate detailed albedo, roughness, and metallic maps from a single indoor scene image.
HAAR can generate realistic 3D hairstyles from text prompts. It uses 3D hair strands to create detailed hair structures and allows for physics-based rendering and simulation.