AudioLDM: Text-to-Audio Generation with
Latent Diffusion Models

Haohe Liu^*1, Zehua Chen^*2, Yi Yuan¹, Xinhao Mei¹, Xubo Liu¹

Danilo Mandic², Wenwu Wang¹, Mark D. Plumbley¹

¹CVSSP, University of Surrey, Guildford, UK

²Department of EEE, Imperial College London, London, UK

^*Equal Contribution

[Paper on ArXiv] [Code on GitHub] [Hugging Face Space]

Abstract

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train latent diffusion models (LDMs) with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion.

Note

AudioLDM generates text-conditional sound effects, human speech, and music.

The LDM is trained on a single GPU, without text supervision.

AudioLDM enables zero-shot text-guided audio style-transfer, inpainting, and super-resolution.

**Figure 1:** Overview of AudioLDM design for text-to-audio generation (left), and text-guided audio manipulation (right). During training, latent diffusion models (LDMs) are conditioned on audio embedding and trained in a continuous space learned by VAE. The sampling process uses text embedding as the condition. Given pretrained LDMs, the zero-shot audio inpainting and style transfer are realized in the reverse process. The block Forward Diffusion denotes the process that corrupt data with gaussian noise.

One-minute overview for a quick-taste
The following sections show more demos.

TTA Generation with Human Text Prompt

Short Samples:

A man is speaking under the water.	Two space shuttles are fighting in the space.	The sound of a steam engine.

Long Samples:

Cat gently meowing and purring in long intervels.	Peaceful and calming ambient music with singing bowl and other instruments.	Nature environmental noise with various bird vocalization, high fidelity, children playing far away and light wind.

Acoustic Environment Control:

A man is speaking in a huge room.	A man is speaking in a small room.	A man is speaking in a studio.

Material Control:

Chopping tomatos on a wooden table.	Chopping meat on a wooden table.	Chopping potatos on a metal table.

Pitch Control:

Sine wave with low pitch.	Sine wave with medium pitch.	Sine wave with high pitch.

Temporal Order Control:

A racing car is passing by and disappear.	A female is speaking followed by footstep sound.	Wooden table tapping sound followed by water pouring sound.

Label-to-Audio Generation:

A capella	Babbling	Burping, eructation

Eruption	Gasp	Fart

Unconditional Generation:

"Null"

TTA Generation with ChatGPT Text Prompt

Short Samples:

Human Question to ChatGPT	Simplified ChatGPT Answer	AudioLDM
Describe the sound of the battlefiled	Battlefield scene, continuous roar of artillery and gunfire, high fidelity, the sharp crack of bullets, the thundering explosions of bombs, and the screams of wounded soldiers.
Describe what does a pop music sound	Pop music that upbeat, catchy, and easy to listen, high fidelity, with simple melodies, electronic instruments and polished production.
Describe the sound of the ocean	The steady crashing of waves against the shore,high fidelity, the whooshing sound of water receding back into the ocean, the sound of seagulls and other coastal birds, and the distant sound of ships or boats.

Long Samples:

Human Question to ChatGPT	Simplified ChatGPT Answer	AudioLDM
Describe the sound of the outer space	Radio emissions from stars, planets, galaxies and other celestial bodies, high fidelity, as well as the sounds of solar winds and cosmic rays.
Describe what does a dance music sound with at most 30 words	Dance music with strong, upbeat tempo, and repetitive rhythms, include sub-genres like house, techno, EDM, trance, and many more.
Describe the sound of the factory	Loud and chaotic. Hum and buzz of machinery such as power tools, high fidelity. Clanking and clattering of metal parts, the whirring of motors and engines, and the beeping and alarms of various instruments.
Describe what does a scary music sound with at most 30 words	Scary music with dissonant harmonies, irregular rhythms, and unconventional use of instruments.