Source Separation – GSoC 2021 Week 3

Hi all! Here are this week’s updates: 

I spent the past week refactoring so that there’s a generic EffectDeepLearning that we can inherit from to use deep learning models for other applications outside source separation (like audio generation or labeling). I also modified the resampling behavior. Instead of resampling the whole track (which can be useless if we’re only processing a 10s selection in a 2hr track), resampling is done directly on each buffer block via a torchscript module borrowed from torchaudio. Additionally, I added a CMake script for including built-in separation models in the Audacity package. 

Goals for next week

UI work

Because the separation models will be used as essentially black boxes (audio mixture in, separated sources out), I don’t think there’s much I can do for the actual effect UI, except for allowing the user to import pretrained models and displaying relevant metadata (sample rate, speech/music, and possibly an indicator of processing speed / separation quality). 

The biggest user interaction happens when the user chooses a deep learning model. The models could potentially be hosted in HuggingFace ( The models can range anywhere from 10MB to upwards of 200MB in size, process audio at different sample rates, and take different amounts of time to compute. We could have a dedicated page in the Audacity website or manual that provides information on how to choose and download separation models, as well as provides links to specific models that are Audacity-ready. 

Ideally, we would like to offer models for different use cases. For example, a user wanting to quickly denoise a recorded lecture would be fine using a lower-quality speech separation model with an 8kHz sample rate. On the other hand, someone trying to upmix a song would probably be willing to sacrifice longer compute time and use a higher quality music separation model with a 48kHz sample rate. 

Build work

I’ve managed to get the Audacity + deep learning build working on Linux and MacOS, but not Windows. I’ll spend some time this week looking into writing a Conan recipe for libtorch that simplifies the build process for all platforms.

GSOC 2021 with Audacity – Week 2

In this week, I have finished the basic prototype of the brush tool, the original data structure designed for storing spectral data has now been refactored. Adapting to agile development, I have setup a Kraken board on GitHub Projects over Jira since most of the team members are already on GitHub, the real-time progress of the project will now be traceable.

The refactorization of the structure

The first design is to access SpectrumView via static variable and it has now been fixed, the data should be local to each SpectrumView, meaning that for each stereo channel, the selection is stored separately, same applies for different tracks.

Link to the commit

Suggested by Paul, we are sticking to the workflow of the SpectrumView::DrawClipSpectrum, the structure has been modified from Frequency -> Time points to Time point -> Frequency Bins.

Link to the commit

The missing cursor coordinates

The mouse events associated with UIHandle is not captured constantly (or frequent enough), meaning that if the user drag the mouse dramatically, some of the coordinates will be missed, considering the following graph, where the dragging speed increases from top to bottom:

An easy cheat will be to connect the last visited coordinate to the current one using wxDC::DrawLine, we can even customize the thickness with single parameter, however it only affects the selected area visually, the continuous coordinates are still missing from the structure. Since it is impossible to capture the pixel-perfect line on our screen, we need algorithm to estimate the line, ideally with customizable thickness, and Bresenham’s line drawing algorithm has been chosen, it will be further modified since we expect the brush to be circular, but it will be good enough for prototyping.

Link to the commit

To be done: the UX and undo/redo

As an ordinary application user, using keyboard shortcuts like Ctrl+Z & Ctrl+Y almost becomes our muscle memory, and of course we expect to have similar functionality for the tool! Since the structure is new to the codebase, and we cannot simply reuse the ModifyState from the ProjectHistory, we will need to inform the base class about copying this new structure when adding to the state history.

Source Separation – GSoC 2021 Week 2

Hi all! Here are this week’s updates: 

I finished writing the processing code! Right now, the Source Separation effect in Audacity is capable of loading deep learning models from disk and using them to perform source separation on a user’s selected audio. When the Source Separation effect is applied to the audio, the effect creates a new track for each separated source. Because Source Separation models tend to operate at lower sample rates (a common sample rate for deep learning models is 16 kHz), each output track is resampled and converted to the same sample format as the original mix track. To name each of the new source tracks, each of the source labels are appended to the Mix’s track name. For example, if my original track is called “mix”, and my sources are [“drums”, “bass”], the new tracks will be named “mix – drums” and “mix – bass”. Here’s a quick demo of my progress so far:

Goals for next week:

  • Each new separated source track should be placed below the original mix. Right now, this is not the case, as all the new tracks are written at the bottom of the tracklist. I’d like to amend this behavior so that each separated source is grouped along with its parent mix.
  • Pass multi-channel audio to deep learning models as a multi-channel tensor. Let the model author decide what to do with multiple channels of audio (i.e. downmix). Write a downmix wrapper in PyTorch. 
  • Refactor processing code. A lot of the preprocessing and postprocessing steps could be abstracted away to make a base DeepLearningEffect class that contains useful methods that make use of deep learning models for different applications (e.g. automatic labeling, sound generation). 
  • Brainstorm different ideas for making a more useful // attractive UI for the Source Separation effect. 

One idea for a Source Separation UI that’s been on the back of my head is to take advantage of Audacity’s Spectral Editing Tools (refer to fellow GSoC project by Edward Hui) to make a Source Separation Mask editor. This means that we would first have a deep learning model estimate a separation mask for a given spectrogram, and then let a user edit and fine-tune the spectral mask using spectral editing tools. This would let a user improve the source separation output through photoshop-like post processing, or even potentially give the model hints about what to separate!

GSOC 2021 with Audacity – Week 1

In this week, I have conducted meetings with my mentor and learned more about the rendering logic, how different methods of inherited UIHandler works together, and we have set the expectation for the following few weeks, I have also completed a prototype of the brush tool.

Works done in this week:

  1. Created BrushHandle, inherited some basic functions and logic.
  2. Tried different approaches for displaying the brush trails in real-time, including rendering from the BrushHandle and SpectrumView respectively.
  3. Setup data structure to store and convert the mouse events into frequency-time bins (adapted automatically to different user scaling).

Next week’s goal:

  1. Change the color of the selected area, to adapt to the existing color gradient scheme
  2. Refactor the data structure of the selected area
  3. Implement new UI components, to erase or apply the editing effect
  4. Append the selection to the state history

Source Separation – GSoC 2021 Week 1

Hi all! My first week of GSoC went great. Here are some project updates:

I started prototyping some wrappers to export pretrained PyTorch source separation models for use in Audacity. The pretrained models will most likely be grabbed from Asteroid, an open source library with lots of recipes for training state of the art source separation models. Most Asteroid models are torchscript compatible via tracing (see this pull request), and I’ve already successfully traced a couple of ConvTasNet models trained for speech separation that should be ready for Audacity. You can look at these wrapper prototypes in this Colab notebook. The idea is to ship the model with JSON encoded metadata that we can display to the user. This way, we can inform a user of a model’s domain (e.g. speech // music), sample rate, size (larger models require a larger amount of compute), and output sources (e.g. Bass, Drums, Voice, Other).

Wrapping a model for Audacity should be straightforward for people familiar with source separation in PyTorch, and I’m planning on adding a small set of wrappers to nussl that facilitate the process, accompanied by a short tutorial. Ideally, this should encourage the research community to share their groundbreaking source separation models with the Audacity community, giving us access to the latest and greatest source separation! 🙂 

On the Audacity side, I added a CMake script that adds libtorch to audacity. However, because CPU-only libtorch is a whopping 200MB, source separation will likely be an optional feature, which means that libtorch needs to be an optional download that can be linked at runtime.  I will be in touch with my mentor Dmitry Vedenko about figuring out a way forward from there. 

I started writing code for the SourceSep effect in Audacity. So far, I’m able to load torchscript models into the effect, and display each model’s metadata.  By next week, I’d like to finish writing the processing code, where the audio data is converted from Audacity’s WaveTrack to a Torch Tensor, processed by the torchscript model, and the output is converted back to a WaveTrack.

Audacity’s Effect interface lacks the capability to write the output of an effect to new WaveTracks. This behavior is desirable for source separation, since a model that separates into 4 sources (Drums, Bass, Voice, and Other), would ideally create 4 new WaveTracks bound to the input track, one track for each source.  Analysis effects (like FindClipping) already create a new label track that gets bound to the input track. I’ll dig deeper into how this is done, and see if I can extend this behavior so a variable number of WaveTracks can be created to write the separation output.

Goals for Next Week:

  • Finish writing the Effect Processing code so each output source is appended to the input WaveTrack. 
  • Start thinking about an approach to writing the separation output to new, multiple WaveTracks.

Audacity – Spectral editing tools introduction

Hello all, this is Edward Hui from Hong Kong. I have a strong interest in audio/signal processing and Neuroscience, and I have been selected for the project “spectral editing tool” this summer, mentored by Paul Licameli. Here are links to my GitHub and LinkedIn profiles, please feel free to connect with me.

Background of spectral editing

Considering one of the most popular songs in history, Hey Jude by The Beatles as an example. I have fetched the song from YouTube as a wav file and imported it to Audacity (no CD or Vinyl magic happening), the snippet is attached here.

This is the original spectrogram, using the logarithmic scaling, window size of 4096, and band limited to [100 – 5000]Hz.

And two “ding” sounds were added as unwanted noises; at around 2900Hz, a modified snippet is attached here.

Using spectrogram view, the above noises are visualized and easily spotted by users, they are not blending much into the original mixing and their spectral energies are usually high. Common spectral editing examples include: removing unwanted doorbells during voice recording, eliminating coughing from the concert recording, those are common usage of noise removal for ordinary users.

In fact, there is built-in function for handling simple spectral editing, but it is strictly limited to straight line, making the editing not flexible enough to accommodate slightly more complicated noises with pitch variations, say the cat’s meow during voice recording as in the following graph.

The basic deliverable of the project

Brush tool will be introduced as the basic deliverable of this project, making spectral editing more user-friendly and effective, users can simply drag through the desired area, and regions with high spectral energy will be approximated and selected, like the following graph. 

There are few challenges involved in this project

  1. The UI design of the tool and how should it be positioned in the existing toolbar, for better editing experience
  2. The data structure representing the brush and selected area, and which algorithm should we use to estimate the bounded points from continuous mouse position in real-time (most likely Bresenham’s algorithm or Midpoint circle algorithm, combined with Flood fill algorithm)
  3. The method of transforming the calculated area into the corresponding frequency components
  4. The combination of parameters for performing the Short-time Fourier transform and the inverse of it after the editing, i.e. window type, FFT size, and overlapping ratio etc.

Optional features

The brush tool is expected to be completed and delivered before the first evaluation dated 12 July, one of the following features will be selected and developed according to the schedule.

1. Overtone selection

The aforementioned noises in real-life are similar to other audio, consisting of both fundamental frequency(F0) and overtone resonances; to effectively eliminate the unwanted noises, they should be all selected and removed.

It would be nice to approximate the overtones automatically from the F0, without users’ manual selection, and the threshold decision for such approximation is important.

2. Area re-selection

The area selected by the new tools can be adjusted using UI components like sliders, to decide the spectral energy threshold, for improving the editing experience.


This project aims to make spectral editing more widely accessible for all users regardless of their editing experience, the above features are hopefully to complete the spectral editing function of Audacity and empower more creative editing ideas. 

Thanks to the Audacity team once again for accepting my proposal and I am looking forward to the coding stage! I will be writing weekly blogs during development and the links will also be updated here.

Source Separation and Extensible MIR Tools for Audacity

Hello! My name is Hugo Flores Garcia. I have been selected to build source separation and other music information retrieval (MIR) tools for Audacity, as a project for Google Summer of Code. My mentor is Dmitry Vedenko from Audacity’s new development team, with Roger from the old team providing assistance.

What does source separation do, anyway?

Source separation would bring many exciting opportunities for Audacity users. The goal of audio source separation is to isolate the sound sources in a given mixture of sounds. For example, a saxophone player may wish to learn the melody to a particular jazz tune, and can use source separation to isolate the saxophone from the rest of the band and learn the part. On the other hand, a user may separate the vocal track from their favorite song to generate a karaoke track. In recent years, source separation has enabled a wide range of applications for people in the audio community, from “upmixing” vintage tracks to cleaning up podcast audio

Source separation aims to isolate individual sounds from the rest of a mixture. It is the opposite of mixing different sounds, which can be a complex, non-linear process. How sources are mixed makes separating them a difficult problem, suitable for deep learning. Image used courtesy of Ethan Manilow, Prem Seetharaman, and Justin Salamon [source].

For an in-depth tutorial on the concepts behind source separation and coding up your own source separation models in Python, I recommend this awesome ISMIR 2020 tutorial

Project Details

This project proposes the integration of deep learning based computer audition tools into Audacity. Though the focus of this project is audio source separation, the framework can be constructed such that the integration of other desirable MIR tools, such as automatic track labeling and tagging, can be later incorporated with relative ease by using the same deep learning infrastructure and simply introducing new interfaces for users to interact with. 

State of the art (SOTA) source separation systems are based on deep learning models. One thing to note is that individual source separation models are designed for specific audio domains. That is, users will have to choose different models for different tasks. For example, a user must pick a speech separation model to separate human speakers, and a music separation model to separate musical instruments in a song. 

Moreover, there can be a tradeoff between separation quality and the size of the model, and larger models will take a considerably longer amount of time to separate audio. This is especially true when users are performing separation without a GPU, which is our expected use case.  We need to find the right balance of quality and performance that will be suitable for most users. That being said, we expect users to have different machines and quality requirements, and want to provide support for a wide variety of potential use cases. 

Because we want to cater to this wide variety of source separation models, I plan on using a  modular approach to incorporating deep models into Audacity, such that different models can be swapped and used for different purposes, as long as the program is aware of the input and output constraints. PyTorch’s torchscript API lets us achieve such a design, as Python models can be exported into “black box” models that can be used in C++ applications. 

With a modular deep learning framework incorporated into Audacity, staying up to date with SOTA source separation models is simple. I plan to work closely with the Northwestern University Source Separation Library (nussl), which is developed and maintained by scientists with a strong presence in the source separation research community.  Creating a bridge between pretrained nussl models and the Audacity source separation interface ensures that users will always have access to the latest models in source separation research. 

Another advantage of this modular design is that it lays all the groundwork necessary for the incorporation of other deep learning-based systems in Audacity, such as speech recognition and automatic audio labeling!

I believe the next generation of audio processing tools will be powered by deep learning. I am excited to introduce the users of the world’s biggest free and open source audio editor to this new class of tools, and look forward to seeing what people around the world will do with audio source separation in Audacity! You can look at my original project proposal here, and keep track of my progress on my fork of Audacity on GitHub. 


My name is Hugo Flores García. Born and raised in Honduras, I’m a doctoral student at Northwestern University and a member of the Interactive Audio Lab. My research interests include sound event detection, audio source separation, and designing accessible music production and creation interfaces.