Generative Sound AI: How To Get Started With Text-to-Music and Text-to-Sound with AudioCraft

Aug 3, 2023

Open-Source AI-Powered Speech Enhancement 

In this article, we provide a comprehensive overview of Meta AI’s open-source project, AudioCraft, a suite of three models representing a leap forward in synthetic music generation, and sound effects. The three models that comprise AudioCraft are Music Gen, AudioGen, and EnCodec. We’ll also share how to get started with AudioCraft.


What is AudioCraft?

AudioCraft consists of three core generative AI models developed by Meta AI: MusicGen, AudioGen, and EnCodec. Together, these models enable the generation of music and sound effects through simple text prompts. In a similar fashion to Resemble’s text-to-speech synthesis engine, AudioCraft allows you to essentially create music via ‘text-to-music’ prompts or sound effects through ‘text-to-sound’ prompts. MusicGen synthesizes text-to-music, AudioGen generates text into sound effects, and EnCodec compresses audio into discrete representations referred to as ‘tokens’. By training transformer language models on these compact audio ‘tokens’ from EnCodec, AudioCraft is able to produce audio samples conditioned on the text descriptions. The open-source release of AudioCraft provides researchers and developers with state-of-the-art foundations to advance text-to-music, sound effect generation, and similar areas of generative audio.


Text-to-Music: What is MusicGen?

MusicGen is a single Language Model (LM) trained on a large dataset of over 20,000 hours of licensed music. The model has learned to generate coherent musical compositions in a wide variety of genres and styles from AI text prompts. Trained to generate music based on the conditions of the prompt, MusicGen operates over several streams of compressed discrete music representation referred to as ‘tokens’. The model is able to generate music while being conditioned on descriptive text or melodic features, giving the user greater control over the synthetic music output. In the sample AI text prompts below, MusicGen’s text-to-music engine generated the synthetic music content in the sample clips.

Hip Hop Prompt

‘Uptempo hip hop beat with heavy bass, synthesized horns, and crisp snares, and hi-hats. Features a simple, catchy melody and rhythmic synth patterns. Evokes a smooth, late-night vibe with a lush, ambient quality mixed with hip-hop swagger.’

Rock ‘n Roll Prompt

‘Driving classic rock track with bluesy undertones and smooth, rich vocals. Prominent acoustic and electric guitars with melodic riffs and bright, jangly tone. Steady bassline and lively drumbeat with energetic fills and cymbal crashes.’

Symphony Prompt

Trumpets open with a triumphant melody over rolling timpani. Violins take a sweeping counterpoint as the brass builds to a regal climax. The full orchestra unites on a powerful final chord.

As previously mentioned, MusicGen’s single language model has the ability to cover a variety of musical genres as referenced in the prompts above.

Text-to-Sound: What is AudioGen?

With the literal nomenclature of MusicGen, one doesn’t have to go out on a limb to guess that AudioGen involves audio generation. As previously mentioned, AudioGen specializes in generating environmental sounds and audio textures based on an AI text prompt. You might refer to it as text-to-sound. The model learns to produce sound effects in a data-driven manner by training on a diverse dataset of public sound effect samples.


Encodec, The Neural Audio Codec

Synthesizing text into audio from raw audio signals requires modeling extremely long sequences which can become an obstacle. EnCodec is the key component to enabling high-fidelity AI generated music and sound. This neural audio codec compresses raw audio waveforms into compact discrete representations. By training transformer language models on these audio ‘tokens’, AudioCraft can then reconstruct near photo-realistic audio from sparse sequences. EnCodec essentially acts as a compact audio ‘vocabulary’ for the generative models to learn. 


How To Get Started With AudioCraft

Anyone can leverage AudioCraft’s state-of-the-art models in their own projects since it is open source. The code, model weights, and documentation are all publicly available on Github. Developers can begin experimenting with the pretrained models or train custom models by applying their own datasets.

We’ll dive right into AudioCraft which is required to get AudioGen and MusicGen:

1. System Requirements and Dependencies

    • Python 3.9
    • PyTorch 2.0.0
    • ffmpeg (optional, but recommended)

2. Installation

    • Ensure Phython 3.9 is installed on your system.
    • Install PyTorch if you haven’t already by running:  pip install 'torch>=2.0'
    • Install AudioCraft. You have several options:– For a stable release, run:  pip install -U audiocraft

      – For the latest (possibly unstable) version, run:  pip install -U git+https://[email protected]/facebookresearch/audiocraft#egg=audiocraft

      – If you cloned the repo locally, run:  pip install -e .

Install ffmpeg:

    • On Ubuntu/Debian: sudo apt-get install ffmpeg
    • On Anaconda or Miniconda: conda install 'ffmpeg<5' -c conda-forge

Running MusicGen

Now that we have AudioCraft installed, we can dive right into MusicGen. MusicGen comes with four pretrained models:

    • facebook/musicgen-small: A 300M model, for text to music generation only.
    • facebook/musicgen-medium: A 1.5B model, for text to music generation only.
    • facebook/musicgen-melody: A 1.5B model, for text to music and text+melody to music.
    • facebook/musicgen-large: A 3.3B model, for text to music generation only.

The best trade-off between quality and compute seems to be achieved with the facebook/musicgen-medium or facebook/musicgen-melody model. You must have a GPU to use MusicGen locally. A GPU with 16GB of memory is recommended, but smaller GPUs can generate short sequences or longer sequences with the facebook/musicgen-small model.

Once you know which model you want to use, here’s a Python snippet to get you started:
 

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=8)  # generate 8 seconds of music.
wav = model.generate_unconditional(4)    # generates 4 unconditional audio samples.
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
wav = model.generate(descriptions)  # generates 3 audio samples based on descriptions.

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates music using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

Running AudioGen

The AudioGen API is simple to use, and AudioCraft provides a pre-trained model: facebook/audiogen-medium, a 1.5B model for text to sound.

Here’s a quick Python example of how to use the AudioGen API:

import torchaudio
from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write

model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)  # generate 5 seconds of sound.
wav = model.generate_unconditional(4)    # generates 4 unconditional audio samples
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor']
wav = model.generate(descriptions)  # generates 3 audio samples based on descriptions.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)

This code snippet demonstrates how to load a pre-trained model, set generation parameters, and then generate unconditional audio based on specific textual descriptions. The generated audio is then saved, with loudness normalization applied. This is a simple and effective way to start experimenting with text-prompted audio generation using AudioGen.

Alternatives to AudioCraft

MusicLM from Google
MusicGen and MusicLM are two leading neural text-to-music generation models. Both systems operate on discrete representations of audio and employ autoregressive modeling for high-fidelity synthesis.

MusicGen utilizes a single Transformer decoder over multiple streams of audio tokens from the EnCodec model. This architecture aims for simplicity and efficiency with a single modeling stage. In contrast, MusicLM employs a complex hierarchical approach. It incorporates separate semantic and acoustic modeling stages, as well as additional components like a speech model and MuLan for conditioning. While MusicGen focuses on benchmark results, MusicLM explores new conditioning approaches like audio melody input.

Both models have shown the promise of token-based synthesis, while pushing the envelope on  sound quality. MusicGen simplified the architecture and achieved strong metrics on a public benchmark. MusicLM demonstrated high-fidelity results on free-text captions and non-text conditioning. However, neither system has fully solved the problem. There are still challenges like improving exposure bias, longer-range structure modeling, and public code/model releases. The complementary technology from MusicGen and MusicLM provide excellent foundations and directions for future research on controllable music generation with discrete token representations.

Learn more about MusicLM on Google’s Blog.


Jukebox from Open AI

Jukebox uses a hierarchical VQ-VAE to compress raw audio into discrete tokens at multiple levels of abstraction. It then models these tokens auto-regressively using Transformer models. The highest-level prior captures long-term structure while lower levels fill in details. Jukebox can generate coherent 1-2 minute samples.

In contrast, MusicGen relies on a single Transformer decoder over multiple streams of audio tokens from a tokenizer like EnCodec. It interleaves these token streams in an efficient pattern to reduce computational cost versus flattening all streams. MusicGen generates high quality 32kHz samples and enables conditioning on both text and melodies.

While Jukebox focuses on long-term coherence, MusicGen emphasizes controllable music generation and a simpler architecture. Both systems produce strong results, but have limitations to address like better structure and longer-range consistency for MusicGen, and slower generation and token upsampling for Jukebox. The tradeoffs between the models shows the continued challenges in high-fidelity controllable music generation.

Learn more about Jukebox on Open AI’s blog.


AudioLDM

MusicGen utilizes a single Transformer decoder over multiple streams of audio tokens extracted by EnCodec. This architecture directly models the audio tokens in an autoregressive fashion. In contrast, AudioLDM employs latent diffusion models trained on a compressed latent space produced by a variational autoencoder. The VAE-diffusion model pipeline is more complex but aims to improve sample quality.

MusicGen focuses on high fidelity music generation and controllability. The model supports conditioning on both text descriptions and melodies to guide the output. AudioLDM explores new capabilities like text-guided audio manipulation, enabling zero-shot style transfer, super-resolution, and inpainting. However, its music generation capabilities have not been evaluated extensively for music generation.

While both MusicGen and AudioLM have demonstrated promising results, neither has fully solved the problem of high-quality, controllable music synthesis. MusicGen simplifies the architecture with a single model but loses some fidelity. AudioLDM explores new audio manipulations but its architecture has greater complexity. Nevertheless, these two models present useful innovations in neural music synthesis through their contrasting architecture and benchmark results.

Learn more about AudioLDM on Github.

At Resemble AI, we’re diligently exploring all possibilities in relation to voice AI and believe that open-source projects have the ability to help spur innovation in the sector. We’re excited to see what the future of ethical AI development holds.

More From This Category

Our Commitment to Consent

Our Commitment to Consent

Remember when creating a synthetic voice meant hours in a studio, carefully recording every syllable? Now, with a few clicks, you can clone anyone's voice. It's mind-blowing tech. But with great power comes great responsibility. At Resemble, we've always believed that...

read more
Introducing ‘Edit’ by Resemble AI: Say No More Beeps

Introducing ‘Edit’ by Resemble AI: Say No More Beeps

In audio production, mistakes are inevitable. You’ve wrapped up a recording session, but then you notice a mispronounced word, an awkward pause, or a phrase that just doesn’t flow right. The frustration kicks in—do you re-record the whole segment, or do you spend...

read more