AudioSR: Versatile Audio Super-resolution at Scale

Haohe Liu¹, Ke Chen², Qiao Tian³, Wenwu Wang¹, Mark D. Plumbley¹

¹CVSSP, University of Surrey

²University of California San Diego

³Speech, Audio & Music Intelligence (SAMI), Bytedance

[Paper on ArXiv] [Code on GitHub] [Discord Community]

Abstract

Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4 kHz to 8 kHz). We introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2 kHz to 16 kHz to a high-resolution audio signal at 24 kHz bandwidth with a sampling rate of 48 kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen.