Audio AI Tools
Free audio AI tools for sound design, music composition, and voice synthesis, helping creatives produce unique audio experiences effortlessly.
UniMuMo can generate outputs across text, music, and motion. It achieves this by aligning unpaired music and motion data based on rhythmic patterns.
AudioEditor can edit audio by adding, deleting, and replacing segments while keeping unedited parts intact. It uses a pretrained diffusion model with methods like Null-text Inversion and EOT-suppression to ensure high-quality results.
STA-V2A can generate high-quality audio from videos by extracting important features and using text for guidance. It uses a Latent Diffusion Model for audio creation and a new metric called Audio-Audio Align to measure how well the audio matches the video timing.
UniTalker can create 3D face animations from speech input! It works better than other tools, making fewer mistakes in lip movements and performing well even with new data it hasn’t seen before.
MusiConGen can generate music tracks with precise control over rhythm and chords. It allows users to define musical features through symbolic chord sequences, BPM, and text prompts.
Stable Audio Open can generate up to 47 seconds of stereo audio at 44.1kHz from text prompts. It uses a transformer-based diffusion model for high-quality sound, making it useful for artists and researchers.
PicoAudio is a temporal controlled audio generation framework. The model is able to generate audio with precise timestamp and occurrence frequency control.
FoleyCrafter can generate high-quality sound effects for videos! Results aim to be semantically relevant and temporally synchronized with a video. It also supports text prompts to better control the video-to-audio generation.
Images that Sound can generate spectrograms that look like natural images and produce matching audio when played. It uses pre-trained diffusion models to create these spectrograms based on specific audio and visual prompts.
SVA can generate sound effects and background music for videos based on a single key frame and a text prompt.
SongComposer can generate both lyrics and melodies using symbolic song representations. It aligns lyrics and melodies precisely and outperforms advanced models like GPT-4 in creating songs.
SCG can be used by musicians to compose and improvise new piano pieces. It allows musicians to guide music generation by using rules like following a simple I-V chord progression in C major. Pretty cool.
AudioEditing are two methods for editing audio. The first technique allows for text-based editing, while the second is an approach for discovering semantically meaningful editing directions without supervision.
Auffusion is a Text-to-Audio system that is able to generate audio from natural language prompts. The model is able to control various aspects of the audio, such as acoustic environment, material, pitch, and temporal order. It can also generate audio based on labels or be combined with an LLM model to generate descriptive audio prompts.
AudioLDM 2 can generate high-quality audio in different forms, like text-to-audio and image-to-audio. It uses a smart training method to achieve top performance on important tests.
AudioSep can separate audio events and musical instruments while enhancing speech using natural language queries. It performs well in open-domain audio source separation, significantly surpassing previous models.
LP-MusicCaps can generate high-quality music captions using large language models (LLMs).
WavJourney is a system that uses large language models to generate audio content with storylines encompassing speech, music, and sound effects guided from text instructions. The demo results, while not perfect, sound great.
CoMoSpeech can synthesize speech and singing voices in one step with high audio quality. It runs over 150 times faster than real-time on a single NVIDIA A100 GPU, making it practical for text-to-speech and singing applications.
Msanii can create high-quality music tracks up to 190 seconds long at a sample rate of 44.1 kHz. It uses a diffusion-based method to combine mel spectrograms and neural vocoders, allowing for audio-to-audio style transfer and smooth transitions between audio samples.