  • 3D: WE-GS, NPGA, Text-Mesh-Refinement, Diff3DS, MultiPly, PuzzleFusion++, VividDream, GenWarp, ID-to-3D, PuTT, SuperGaussian, Unique3D, DIRECT-3D, Ouroboros3D, GECO, E3Gen, Physics3D, EASI-Tex
  • 4D: Topo4D, Vidu4D, Sync4D, 4Diffusion
  • Motion: MotionLLM, MoverseAI, Multi-Motion
  • Image: BitsFusion, Packing Collage, L-MAGIC, pOps, MultiEdits, AnyFit, Flash Diffusion, Phased Consistency Model, BIRD, Stable-Pose, SketchDeco
  • Video: ToonCrafter, CV-VAE, StreamV2V, Human4DiT, UniAnimate, MotionFollower, T2V-Turbo, SF-V, Follow-Your-Emoji, MOFA-Video, InstructAvatar
  • and more!

News & Papers


WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

WE-GS can reconstruct high-quality 3D Gaussian Splats scenes supporting dynamic lighting conditions from photo collections.

WE-GS example

NPGA: Neural Parametric Gaussian Avatars

NPGA can create high-fidelity, controllable avatars from multi-view video recordings and animate the avatars using a single image or video as input.

NPGA examples

Text-guided Controllable Mesh Refinement for Interactive 3D Modeling

Text-Mesh-Refinement can add geometric details to a coarse 3D mesh input with a text prompt. It first generates an image and then optimize the mesh to generate a fine, detailed geometry as output.

Text-Mesh-Refinement examples

Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering

Diff3DS can generate view-consistent 3D sketches from text or images.

Diff3DS flamingo examples

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

MultiPly can reconstruct multiple people in 3D from monocular in-the-wild videos. The results are pretty good and the method is able to handle occlusions and interactions between people.

MultiPly example

PuzzleFusion++: Auto-agglomerative 3D Fracture Assembly by Denoise and Verify

PuzzleFusion++ is a new 3D fracture assembly method. It can take a bunch of broken 3D objects and automatically align and merge them into a single object.

PuzzleFusion++ examples

VividDream: Generating 3D Scene with Ambient Dynamics

VividDream can generate explorable 4D scenes with ambient dynamics from a single image or text prompt. The method first expands an input image into a static 3D point cloud and then generates an ensemble of animated videos using video diffusion models. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics.

VividDream example

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

GenWarp can generate novel views from a single input image and preserve the semantics of the input image when generating new views. Also works with heavily stylized images.

GenWarp example

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

ID-to-3D can generate personalized 3D human heads from a single image of a subject. It can accurately reconstruct not only facial features but also accessories and hair which can be meshed to provide render-ready assets.

ID-to-3D examples

Coarse-To-Fine Tensor Trains for Compact Visual Representations

PuTT is able to optimize the highly compact tensor train representation, making it possible to use them for image fitting, 3D fitting, and novel view synthesis.

Zooming in on the “Girl With Pearl Earrings” image reconstructed at 16k resolution using the PuTT method

SuperGaussian: Repurposing Video Models for 3D Super Resolution

SuperGaussian can upsample 3D models by adding geometric and appearance details by repurposing existing video models for 3D super-resolution.

SuperGaussian example

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Unique3D is yet another image-to-3D method. This one is able to generate high-quality 3D meshes with intricate textures and complex geometries from a single image.

Unique3D example

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

DIRECT-3D can generate high-quality 3D objects from text prompts with accurate geometric details and various textures in 12 seconds on a single V100.

DIRECT-3D examples

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Ouroboros3D is another image-to-3D method that is able to generate high-quality 3D objects from a single image.

Ouroboros3D examples

GECO: Generative Image-to-3D within a SECOnd

GECO can generate 3D objects from a single image in less than a second.

GECO examples

E3Gen: Efficient, Expressive and Editable Avatars Generation

E3Gen can generate diverse and expressive 3D avatars with full-body pose control and editing.

E3Gen examples

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Physics3D can simulate a wide range of materials with high-fidelity capabilities and is able to predict the physical properties of materials and incorporate them into the behavior prediction process.

Physics3D examples

EASI-Tex: Edge-Aware Mesh Texturing from Single Image

EASI-Tex can texture 3D objects with the details of a single image while respecting their geometry.

EASI-Tex examples


Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture

Topo4D is a new method for 4D head capture that can generate high-quality dynamic facial meshes and 8K textures from videos.

Topo4D example

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Vidu4D can reconstruct high-fidelity 4D representations from a single generated video. The method is able to capture motion and deformation over time and preserves fine-grained appearance details.

Vidu4D example

Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation

Sync4D can transfer the motion of objects from reference videos to a variety of generated 3D Gaussians! It supports diverse reference inputs including humans, quadrupeds, and articulated objects.

Sync4D examples

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

4Diffusion can generate high-quality 4D scenes from a single video.

4Diffusion examples


MotionLLM: Multimodal Motion-Language Learning with Large Language Models

MotionLLM can generate single-human, multi-human motions, and motion captions by fine-tuning pre-trained LLMs.

a person is doing rope skipping exercise in the park

Towards Practical Single-shot Motion Synthesis

MoverseAI can mix and compose motions with a single forward pass and is up to 6.8x faster to train than other methods.

MoverseAI Single-Motion Variations

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Multi-Motion can reconstruct natural and diverse group motions of multiple humans from a video input and textual descriptions.

Multi-Motion example


BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

BitsFusion is a new weight quantization method that can quantize the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size (1.72GB vs 219MB) while exhibiting even better generation quality than the original one.

Top: full-precision Stable Diffusion v1.5. Bottom: 1.99 bits BitsFusion.

A Versatile Collage Visualization Technique

Packing Collage can pack geometric elements into a given shape. The method is highly efficient and can easily accommodate various loss functions, making it suitable for various visualization applications.

Packing Collage example

L-MAGIC: Language Model Assisted Generation of Images with Coherence

L-MAGIC can generate 360 degree panoramic scenes from a single input image and a text prompt. The method is able to diffuse multiple coherent views of the scene and can also accept other input modalities, such as depth maps, sketches, and colored scripts.

Image to panorama (the input is inside the bounding box).

pOps: Photo-Inspired Diffusion Operators

pOps can learn specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior and can be used to apply a variety of photo-inspired effects to images.

pOps scene operator examples

MultiEdits: Simultaneous Multi-Aspect Editing with Text-to-Image Diffusion Models

MultiEdits can make simultaneous edits across multiple objects or attributes given a single text prompt.

MultiEdits examples

AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario

AnyFit is a virtual try-on method that can generate high-fidelity and robust fitting images across various scenarios.

AnyFit examples

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

Flash Diffusion can generate high-quality images with as few as 5 steps and is compatible with various tasks such as text-to-image, inpainting, face-swapping, and super-resolution.

Flash Diffusion in action

Phased Consistency Model

PCM is a new consistency model that is specifically designed for multi-step image and video generation. It can generate high-resolution images and videos with up to 16 steps and achieves superior or comparable 1-step generation results to previous methods like LCM.

PCM comparison with previous methods

Blind Image Restoration via Fast Diffusion Inversion

BIRD can restore images from Gaussian blur, motion blur, and JPEG compression artifacts.

BIRD example

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Stable-Pose is a new method for pose-guided text-to-image generation outperforming ControlNet.

Stable-Pose comparison with other methods

SketchDeco: Decorating B&W Sketches with Colour

SketchDeco can turn black and white sketches, masks, and colour palettes into realistic image without a user-defined text prompt.

SketchDeco examples


ToonCrafter: Generative Cartoon Interpolation

ToonCrafter can generate in-between frames for animations and allows users to control the interpolation process by providing images of keyframes.

This video got generated from the first and last frame of the animation. Checkout the project page above for more examples.

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

CV-VAE is a compatible video VAE for latent generative video models. With it, existing video models can generate four times more frames with minimal finetuning.

CV-VAE examples

Looking Backward: Streaming Video-to-Video Translation with Feature Banks

StreamV2V is a new video-to-video method that can translate videos in real-time with user prompts. The method is able to run at 20 FPS on a single A100 GPU and is able to maintain temporal consistency. Also works great for text-to-image streaming.

StreamV2V helps with smoother transitions

Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer

Human4DiT can generate high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints.

Human4DiT examples

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

UniAnimate can animate a single image with a sequence of desired movement poses and is able to generate highly consistent videos with a length of up to one minute.

UniAnimate examples

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

MotionFollower can edit video motion while preserving the original protagonist’s appearance and background.

MotionFollower example

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

T2V-Turbo is a new video consistency model that can generate videos from text in just 4 steps.

T2V-Turbo example

SF-V: Single Forward Video Generation Model

SF-V is a single-step video generation model that can be used to generate high-quality videos with both temporal and spatial dependencies. The model is able to achieve real-time video synthesis and editing.

SF-V example

Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Follow-Your-Emoji can animate a reference portrait with target landmark sequences. The method is able to control the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals.

Follow-Your-Emoji examples

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

MOFA-Video can generate videos from a single image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations.

MOFA-Video example

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

InstructAvatar can generating emotionally expressive 2D avatars from an image and text prompt. The model is able to control the emotion as well as the facial motion of avatars.

InstructAvatar example

Also interesting

  • Part123: Part-aware 3D Reconstruction from a Single-view Image
  • RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
  • RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
  • DiffCut: Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features
  • MeshVPR: Citywide Visual Place Recognition Using 3D Meshes
  • Matching Anything by Segmenting Anything
  • Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

