Source Separation and Extensible MIR Tools for Audacity

Hello! My name is Hugo Flores Garcia. I have been selected to build source separation and other music information retrieval (MIR) tools for Audacity, as a project for Google Summer of Code. My mentor is Dmitry Vedenko from Audacity’s new development team, with Roger from the old team providing assistance.

What does source separation do, anyway?

Source separation would bring many exciting opportunities for Audacity users. The goal of audio source separation is to isolate the sound sources in a given mixture of sounds. For example, a saxophone player may wish to learn the melody to a particular jazz tune, and can use source separation to isolate the saxophone from the rest of the band and learn the part. On the other hand, a user may separate the vocal track from their favorite song to generate a karaoke track. In recent years, source separation has enabled a wide range of applications for people in the audio community, from “upmixing” vintage tracks to cleaning up podcast audio

Source separation aims to isolate individual sounds from the rest of a mixture. It is the opposite of mixing different sounds, which can be a complex, non-linear process. How sources are mixed makes separating them a difficult problem, suitable for deep learning. Image used courtesy of Ethan Manilow, Prem Seetharaman, and Justin Salamon [source].

For an in-depth tutorial on the concepts behind source separation and coding up your own source separation models in Python, I recommend this awesome ISMIR 2020 tutorial

Project Details

This project proposes the integration of deep learning based computer audition tools into Audacity. Though the focus of this project is audio source separation, the framework can be constructed such that the integration of other desirable MIR tools, such as automatic track labeling and tagging, can be later incorporated with relative ease by using the same deep learning infrastructure and simply introducing new interfaces for users to interact with. 

State of the art (SOTA) source separation systems are based on deep learning models. One thing to note is that individual source separation models are designed for specific audio domains. That is, users will have to choose different models for different tasks. For example, a user must pick a speech separation model to separate human speakers, and a music separation model to separate musical instruments in a song. 

Moreover, there can be a tradeoff between separation quality and the size of the model, and larger models will take a considerably longer amount of time to separate audio. This is especially true when users are performing separation without a GPU, which is our expected use case.  We need to find the right balance of quality and performance that will be suitable for most users. That being said, we expect users to have different machines and quality requirements, and want to provide support for a wide variety of potential use cases. 

Because we want to cater to this wide variety of source separation models, I plan on using a  modular approach to incorporating deep models into Audacity, such that different models can be swapped and used for different purposes, as long as the program is aware of the input and output constraints. PyTorch’s torchscript API lets us achieve such a design, as Python models can be exported into “black box” models that can be used in C++ applications. 

With a modular deep learning framework incorporated into Audacity, staying up to date with SOTA source separation models is simple. I plan to work closely with the Northwestern University Source Separation Library (nussl), which is developed and maintained by scientists with a strong presence in the source separation research community.  Creating a bridge between pretrained nussl models and the Audacity source separation interface ensures that users will always have access to the latest models in source separation research. 

Another advantage of this modular design is that it lays all the groundwork necessary for the incorporation of other deep learning-based systems in Audacity, such as speech recognition and automatic audio labeling!

I believe the next generation of audio processing tools will be powered by deep learning. I am excited to introduce the users of the world’s biggest free and open source audio editor to this new class of tools, and look forward to seeing what people around the world will do with audio source separation in Audacity! You can look at my original project proposal here, and keep track of my progress on my fork of Audacity on GitHub. 


My name is Hugo Flores García. Born and raised in Honduras, I’m a doctoral student at Northwestern University and a member of the Interactive Audio Lab. My research interests include sound event detection, audio source separation, and designing accessible music production and creation interfaces.