Introducing Resemble Enhance: Open Source Speech Super Resolution AI Model

Dec 14, 2023

Open-Source AI-Powered Speech Enhancement 

In digital audio technology, the necessity for crystal clear sound quality is paramount, however achieving pristine sound quality has remained a consistent challenge. Background noise, distortions, and bandwidth limitations can significantly hinder clarity, comprehension and user experience. Today, we introduce Resemble Enhance, an AI-powered model designed to transform noisy audio into clear and impactful speech. Enhance improves the overall quality of speech with two modules: a sophisticated denoiser and a state-of-the-art enhancer. If you’d like to try out Enhance right now, please click the link below. To learn more about use cases and the technology behind the model continue reading below! 

The Catalyst for Resemble Enhance

Current speech enhancement techniques are pivotal for a variety of applications, yet they often fall short when faced with the intricate challenges of modern sound environments. These existing methods can be limited, particularly when extracting clarity from the cacophony of background noise, or when restoring historical recordings. The need for more sophisticated enhancement technologies is evident across a spectrum of industries. Podcast producers depend on high-quality audio to connect with their audience through crystal-clear narratives, the entertainment industry depends heavily on immaculate audio tracks to create immersive experiences, and possibly the most challenging audio restoration works to breathe new life into archived sounds.

Dual Module Speech Enhancement

Resemble Enhance is equipped to address these diverse use cases with unprecedented precision and ease. To solve for these speech imperfections, we’ve employed the power of advanced generative models for speech enhancement. Not only does Enhance cleanse audio of noise but also enriches its overall perceptual quality. Enhance consists of two modules: a denoiser, which separates speech from noisy audio, and an enhancer, which further boosts the perceptual audio quality by restoring audio distortions and extending the audio bandwidth. The two models are trained on high-quality 44.1kHz speech data that guarantees the enhancement of your speech with high quality. Whether, reviving archived audio or extracting high quality speech from background music the models compliment each other to enhance speech. The video below showcases the model enhancing a conversation between two people on a busy street. 

 

Listen to the original audio at the start and compare it to the enhanced audio near the end.

Speech Enhancement: Denoiser

At the heart of Resemble Enhance lies a sophisticated denoiser. Imagine this module as a filter that meticulously separates speech from unwanted background noise. The denoiser employed in this context utilizes a UNet model that accepts a noise-infused complex spectrogram as its input. The model then predicts the magnitude mask and phase rotation, effectively isolating the speech from the original audio. This methodology aligns with the one delineated in the AudioSep paper [1].

Resemble Enhance: Denoised Spectrogram
Resemble Enhance: Denoiser
Resemble Enhance: Denoised Spectrogram
Resemble Enhance Denoiser

Speech Enhancement: Enhancer

The enhancer is a latent conditional flow matching (CFM) model. It consists of an Implicit Rank-Minimizing Autoencoder (IRMAE) and a CFM model that predicts the latents.

Stage 1

This first stage involves an autoencoder that compresses the clean mel spectrogram $M_\text{clean}$ into a compact latent representation $Z_\text{clean}$, which is then decoded and vocoded back into a waveform. The model consists of an encoder, decoder and vocoder. Both the encoder and decoder is based on residual conv1d blocks, and the vocoder is a UnivNet that incorporates with the AMP Block from BigVGAN.

We train this module in an end-to-end manner with the GAN-based vocoder losses, including the multi-resolution STFT losses and discriminator losses, together with a mel reconstruction loss.

Implicit Rank-Minimizing Autoencoder (IRMAE)

Stage 2

After completing the training of the first stage, we freeze the IRMAE and only train the latent CFM model.

The CFM model is conditioned on a blended Mel $M_\text{blend} = \alpha M_\text{denoised} + (1- \alpha ) M_\text{noisy}$ , derived from the noisy mel-spectrogram $M_\text{noisy}$ and a denoised mel-spectrogram $M_\text{denoised}$.

Here, $\alpha$ is the parameter that adjusts the strength of the denoiser. During training, we set $\alpha$ to follow a uniform distribution $\mathcal{U}(0, 1)$. During inference, the value of $\alpha$ can be controlled by the user.

We load the pre-trained denoiser and train it jointly with the latent CFM model to predict the latent representation of the clean speech.

Conditional Flow Matching Diagram

The CFM model used in our work is based on a non-causal WaveNet model. To train the CFM model, we employ the I-CFM training objective [3]. It enables to train a model that transforms an initial point $Z_0$, which is drawn from a predefined probability distribution $p(Z_0)$, to a point that resembles one from the target distribution, i.e. the distribution of the clean mel latents, denoted by $Z_1 \sim q(Z_1)$.

The initial distribution, $p(Z_0)$, is a blend of the noisy mel latents and a noise drawn from standard Gaussian distribution. We start by taking the latent representation of a noisy mel spectrogram $Z_\text{noisy}$ and a random gaussian noise $\epsilon \sim \mathcal{N}(0, 1)$, then select a blending parameter $\lambda \sim \mathcal{U}(0, 1)$. $Z_0$ is then computed as: $Z_0 = \lambda Z_\text{noisy} + (1 – \lambda)\epsilon$. Blending noisy mel latents with Gaussian noise permits the initiation of inference using noisy mel latents, while the addition of Gaussian noise ensures that the prior space is adequately supported. Below are the before and after-Resemble Enhance look at three spectrograms featuring speech with different background audio ranging from traffic sounds to background music.

Spectrogram Before and After Enhance

The Future of Enhance

Moving forward with the development of Resemble Enhance, our commitment lies in deploying our sophisticated AI models to elevate even the most antiquated audio—think recordings from over 75 years ago—to unparalleled clarity. Although Enhance already demonstrates remarkable robustness and adaptability, efforts to accelerate processing times are ongoing, and we are dedicated to expanding the user’s command over nuanced speech elements such as accentuation and rhythmic patterns. Keep an eye on this space for the latest advancements as we persist in pushing the boundaries of audio technology innovation.

References

[1] https://arxiv.org/abs/2308.05037
[2] https://openreview.net/forum?id=PqvMRDCJT9t
[3] https://arxiv.org/pdf/2302.00482

More From This Category

Our Commitment to Consent

Our Commitment to Consent

Remember when creating a synthetic voice meant hours in a studio, carefully recording every syllable? Now, with a few clicks, you can clone anyone's voice. It's mind-blowing tech. But with great power comes great responsibility. At Resemble, we've always believed that...

read more
Introducing ‘Edit’ by Resemble AI: Say No More Beeps

Introducing ‘Edit’ by Resemble AI: Say No More Beeps

In audio production, mistakes are inevitable. You’ve wrapped up a recording session, but then you notice a mispronounced word, an awkward pause, or a phrase that just doesn’t flow right. The frustration kicks in—do you re-record the whole segment, or do you spend...

read more