Audio AI Tools | AI Toolbox

STAR

STAR can generate audio from speech input while capturing important sounds and scene details.

19.10.25 · Project Page · Code · Text-to-Audio

Manipulation by Analogy can change audio textures by learning from paired speech examples. It allows users to add, remove, or replace sounds, and it works well in real-world situations beyond just speech.

27.09.25 · Project Page · Code · Audio Editing

HuMo

HuMo can generate high-quality human-centric videos from text, images, and audio. It ensures that the subjects are preserved and the audio matches the visuals, using advanced training methods for better control.

17.09.25 · Project Page · Code · Text-to-Video · Audio-to-Video · Image-to-Video

Hear-Your-Click

Hear-Your-Click can generate specific sounds for objects in videos when users click on them. It improves the connection between sound and visuals, allowing for precise audio that matches user-selected objects.

08.07.25 · Code · Video-to-Audio

METEOR

METEOR can generate orchestral music while allowing control over the texture of the accompaniment. It achieves high-quality music style transfer and lets users adjust melodies and textures at the bar and track levels.

30.06.25 · Project Page · Code · Text-to-Music

ThinkSound

ThinkSound can generate sound from video either with a caption or Chain-of-Thought.

26.06.25 · Project Page · Code · Demo · Video-to-Audio

OmniSep

OmniSep can isolate clean soundtracks from mixed audio using text, images, and audio queries.

22.06.25 · Project Page · Code · Audio Separation

AudioX

AudioX can generate high-quality audio and music from text, video, images, and existing audio.

02.04.25 · Project Page · Code · Text-to-Audio Image-to-Audio Video-to-Audio

MelQCD

MelQCD can create realistic audio tracks that match silent videos. It achieves high quality and synchronization by breaking down mel-spectrograms into different signal types and using a video-to-all (V2X) predictor.

14.03.25 · Project Page · Code · Video-to-Audio

AnCoGen

AnCoGen can analyze and generate speech by estimating key attributes like speaker identity, pitch, and loudness. It can also perform tasks such as speech denoising, pitch shifting, and voice conversion using a unified masked autoencoder model.

11.03.25 · Project Page · Code · Speech Recognition · Text-to-Speech

Spark-TTS

Spark-TTS can generate customizable voices with control over gender, speaking style, pitch, and rate. It also supports zero-shot voice cloning, allowing smooth language transitions without extra training for each voice.

05.03.25 · Code · Text-to-Speech · Personalized Audio Generation

NotaGen

NotaGen can generate high-quality classical sheet music.

25.02.25 · Project Page · Code · Model · Text-to-Music

SongGen

SongGen can generate both vocals and accompaniment from text prompts using a single-stage auto-regressive transformer. It allows users to control lyrics, genre, mood, and instrumentation, and offers mixed mode for combined tracks or dual-track mode for separate tracks.

19.02.25 · Project Page · Code · Text-to-Music

PeriodWave

PeriodWave can generate high-quality speech waveforms by capturing repeating sound patterns. It uses a period-aware flow matching estimator to outperform other models in text-to-speech tasks and Mel-spectrogram reconstruction.

10.02.25 · Project Page · Code · Text-to-Speech

Semantic Gesticulator

Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.

07.02.25 · Project Page · Code · Audio-to-Motion

Yin-Yang

Yin-Yang can generate music with a clear structure and control over melodies.

24.01.25 · Code · Text-to-Music

TangoFlux

TangoFlux can generate 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU.

31.12.24 · Project Page · Code · Demo · Text-to-Audio

MMAudio

MMAudio can generate high-quality audio that matches video and text inputs. It excels in audio quality and synchronization, with a fast processing time of just 1.23 seconds for an 8-second clip.

23.12.24 · Project Page · Code · Video-to-Audio · Text-to-Audio

Learning Source Disentanglement in Neural Audio Codec

SD-Codec can separate and reconstruct audio signals from speech, music, and sound effects using different codebooks for each type. This method improves how we understand audio codecs and gives better control over audio generation while keeping high quality.

20.12.24 · Project Page · Code · Audio Separation

Taming Data and Transformers for Audio Generation

GenAu is a new scalable transformer-based audio generation architecture that is able to generate high-quality ambient sounds and effects.

24.10.24 · Project Page · Code · Audio Captioning · Personalized Audio Generation