Audio AI Tools
Free audio AI tools for sound design, music composition, and voice synthesis, helping creatives produce unique audio experiences effortlessly.
Hear-Your-Click can generate specific sounds for objects in videos when users click on them. It improves the connection between sound and visuals, allowing for precise audio that matches user-selected objects.
METEOR can generate orchestral music while allowing control over the texture of the accompaniment. It achieves high-quality music style transfer and lets users adjust melodies and textures at the bar and track levels.
ThinkSound can generate sound from video either with a caption or Chain-of-Thought.
OmniSep can isolate clean soundtracks from mixed audio using text, images, and audio queries.
AudioX can generate high-quality audio and music from text, video, images, and existing audio.
MelQCD can create realistic audio tracks that match silent videos. It achieves high quality and synchronization by breaking down mel-spectrograms into different signal types and using a video-to-all (V2X) predictor.
AnCoGen can analyze and generate speech by estimating key attributes like speaker identity, pitch, and loudness. It can also perform tasks such as speech denoising, pitch shifting, and voice conversion using a unified masked autoencoder model.
Spark-TTS can generate customizable voices with control over gender, speaking style, pitch, and rate. It also supports zero-shot voice cloning, allowing smooth language transitions without extra training for each voice.
NotaGen can generate high-quality classical sheet music.
SongGen can generate both vocals and accompaniment from text prompts using a single-stage auto-regressive transformer. It allows users to control lyrics, genre, mood, and instrumentation, and offers mixed mode for combined tracks or dual-track mode for separate tracks.
PeriodWave can generate high-quality speech waveforms by capturing repeating sound patterns. It uses a period-aware flow matching estimator to outperform other models in text-to-speech tasks and Mel-spectrogram reconstruction.
Semantic Gesticulator can generate realistic gestures accompanying speech with strong semantic correspondence vital for effective communication.
Yin-Yang can generate music with a clear structure and control over melodies.
TangoFlux can generate 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU.
MMAudio can generate high-quality audio that matches video and text inputs. It excels in audio quality and synchronization, with a fast processing time of just 1.23 seconds for an 8-second clip.
SD-Codec can separate and reconstruct audio signals from speech, music, and sound effects using different codebooks for each type. This method improves how we understand audio codecs and gives better control over audio generation while keeping high quality.
GenAu is a new scalable transformer-based audio generation architecture that is able to generate high-quality ambient sounds and effects.
ReWaS can generate sound effects from text and video. The method is able to estimate the structural information of audio from the video while receiving key content cues from a user prompt.
F5-TTS can generate natural-sounding speech using a fast text-to-speech system. It supports multiple languages, can switch between languages smoothly, and is trained on a large dataset of 100,000 hours.
UniMuMo can generate outputs across text, music, and motion. It achieves this by aligning unpaired music and motion data based on rhythmic patterns.