Source Separation – GSoC 2021 Week 6

Hi all! 

There aren’t many updates for this week. I spent the past week cleaning out bugs in the model manager related to networking and threading. I hit a block around Wednesday, when the deep learning effect stopped showing up on the Plugin Manager entirely. It took a couple of days for me to figure out , but I’m back on track now, and I’m ready to keep the ball rolling. 

To do:

  • Fix a bug where download progress gauge appears in the bottom left corner of the ModelCardPanel, instead of on top of the install button. 
  • Refactor ModelCard, so that we serialize // deserialize the internal JSON object only when necessary. 
  • Add a top panel for the model manager UI, with the following functionality
    • Search through model cards
    • Filter by 
      • domain (music, speech, etc)
      • Task (separation, enhancement)
      • Other metadata keys
    • Manually add a huggingface repo
  • If a model is installed and there’s a newer version available, let the user know.

Source Separation – GSoC 2021 Week 5

Hi all! Here are this week’s updates: 

I’ve made progress on the Model Manager! Right now, all HuggingFace repositories with the tag “audacity” are downloaded and displayed as model cards (as seen below). If a user chooses to install a model, the model manager queries HuggingFace for the actual model file (the heavy stuff) and installs it into a local directory. This interface lets users choose from a variety of Deep Learning models trained by contributors around the world for a wide variety of applications.

To do: 

  • GUI work
  • searching and filtering between model cards
  • Grab a music separation model!
A prettier GUI coming soon!

Source Separation – GSoC 2021 Week 4

Hi all! Here are this week’s updates: 

Though the focus of the project is on making a source separation effect, a lot of the code written for this effect has shown to be generic enough that it can be used with any deep-learning based audio processor, given that it meets certain input-output constraints. Thus, we will be providing a way for researchers and deep learning practitioners to share their source separation (and more!) models with the Audacity community. 

The “Deep Learning Effect” infrastructure can be used with any PyTorch-based models that take a single-channel (multichannel optional) waveform, and output an arbitrary number of audio waveforms, which are then written to output tracks.   

This opens up the opportunity to make available an entire suite of different processors, like speech denoisers, speech enhancers, source separation, audio superresolution, etc., with contributions from the community. People will be able to upload the models they want to contribute to HuggingFace, and we will provide an interface for users to see and download these models from within Audacity. I will be working with nussl to provide wrappers and guidelines for making sure that the uploaded models are compatible with Audacity. 

I met with Ethan from the nussl team, as well as Jouni and Dmitry from the Audacity team. We talked about what the UX design would look like for using the Deep Learning effects in Audacity. In order to make these different models available to users, we plan on designing a package manager-style interface for installing and uninstalling deep models in Audacity. 

I made a basic wireframe of what the model manager UI would look like:

Goals for this week:

  • Work on the backend for the deep model manager in audacity. The manager should be able to 
    • Query HuggingFace for model repos that match certain tags (e.g. “Audacity”). 
    • Keep a collection of these repos, along with their metadata.
    • Search and filter through the repos with respect to different metadata fields.
    • Be able to install and uninstall different models upon request.

Source Separation – GSoC 2021 Week 3

Hi all! Here are this week’s updates: 

I spent the past week refactoring so that there’s a generic EffectDeepLearning that we can inherit from to use deep learning models for other applications outside source separation (like audio generation or labeling). I also modified the resampling behavior. Instead of resampling the whole track (which can be useless if we’re only processing a 10s selection in a 2hr track), resampling is done directly on each buffer block via a torchscript module borrowed from torchaudio. Additionally, I added a CMake script for including built-in separation models in the Audacity package. 

Goals for next week

UI work

Because the separation models will be used as essentially black boxes (audio mixture in, separated sources out), I don’t think there’s much I can do for the actual effect UI, except for allowing the user to import pretrained models and displaying relevant metadata (sample rate, speech/music, and possibly an indicator of processing speed / separation quality). 

The biggest user interaction happens when the user chooses a deep learning model. The models could potentially be hosted in HuggingFace (https://huggingface.co/). The models can range anywhere from 10MB to upwards of 200MB in size, process audio at different sample rates, and take different amounts of time to compute. We could have a dedicated page in the Audacity website or manual that provides information on how to choose and download separation models, as well as provides links to specific models that are Audacity-ready. 

Ideally, we would like to offer models for different use cases. For example, a user wanting to quickly denoise a recorded lecture would be fine using a lower-quality speech separation model with an 8kHz sample rate. On the other hand, someone trying to upmix a song would probably be willing to sacrifice longer compute time and use a higher quality music separation model with a 48kHz sample rate. 

Build work

I’ve managed to get the Audacity + deep learning build working on Linux and MacOS, but not Windows. I’ll spend some time this week looking into writing a Conan recipe for libtorch that simplifies the build process for all platforms.

Source Separation – GSoC 2021 Week 2

Hi all! Here are this week’s updates: 

I finished writing the processing code! Right now, the Source Separation effect in Audacity is capable of loading deep learning models from disk and using them to perform source separation on a user’s selected audio. When the Source Separation effect is applied to the audio, the effect creates a new track for each separated source. Because Source Separation models tend to operate at lower sample rates (a common sample rate for deep learning models is 16 kHz), each output track is resampled and converted to the same sample format as the original mix track. To name each of the new source tracks, each of the source labels are appended to the Mix’s track name. For example, if my original track is called “mix”, and my sources are [“drums”, “bass”], the new tracks will be named “mix – drums” and “mix – bass”. Here’s a quick demo of my progress so far:

Goals for next week:

  • Each new separated source track should be placed below the original mix. Right now, this is not the case, as all the new tracks are written at the bottom of the tracklist. I’d like to amend this behavior so that each separated source is grouped along with its parent mix.
  • Pass multi-channel audio to deep learning models as a multi-channel tensor. Let the model author decide what to do with multiple channels of audio (i.e. downmix). Write a downmix wrapper in PyTorch. 
  • Refactor processing code. A lot of the preprocessing and postprocessing steps could be abstracted away to make a base DeepLearningEffect class that contains useful methods that make use of deep learning models for different applications (e.g. automatic labeling, sound generation). 
  • Brainstorm different ideas for making a more useful // attractive UI for the Source Separation effect. 

One idea for a Source Separation UI that’s been on the back of my head is to take advantage of Audacity’s Spectral Editing Tools (refer to fellow GSoC project by Edward Hui) to make a Source Separation Mask editor. This means that we would first have a deep learning model estimate a separation mask for a given spectrogram, and then let a user edit and fine-tune the spectral mask using spectral editing tools. This would let a user improve the source separation output through photoshop-like post processing, or even potentially give the model hints about what to separate!

Source Separation – GSoC 2021 Week 1

Hi all! My first week of GSoC went great. Here are some project updates:

I started prototyping some wrappers to export pretrained PyTorch source separation models for use in Audacity. The pretrained models will most likely be grabbed from Asteroid, an open source library with lots of recipes for training state of the art source separation models. Most Asteroid models are torchscript compatible via tracing (see this pull request), and I’ve already successfully traced a couple of ConvTasNet models trained for speech separation that should be ready for Audacity. You can look at these wrapper prototypes in this Colab notebook. The idea is to ship the model with JSON encoded metadata that we can display to the user. This way, we can inform a user of a model’s domain (e.g. speech // music), sample rate, size (larger models require a larger amount of compute), and output sources (e.g. Bass, Drums, Voice, Other).

Wrapping a model for Audacity should be straightforward for people familiar with source separation in PyTorch, and I’m planning on adding a small set of wrappers to nussl that facilitate the process, accompanied by a short tutorial. Ideally, this should encourage the research community to share their groundbreaking source separation models with the Audacity community, giving us access to the latest and greatest source separation! 🙂 

On the Audacity side, I added a CMake script that adds libtorch to audacity. However, because CPU-only libtorch is a whopping 200MB, source separation will likely be an optional feature, which means that libtorch needs to be an optional download that can be linked at runtime.  I will be in touch with my mentor Dmitry Vedenko about figuring out a way forward from there. 

I started writing code for the SourceSep effect in Audacity. So far, I’m able to load torchscript models into the effect, and display each model’s metadata.  By next week, I’d like to finish writing the processing code, where the audio data is converted from Audacity’s WaveTrack to a Torch Tensor, processed by the torchscript model, and the output is converted back to a WaveTrack.

Audacity’s Effect interface lacks the capability to write the output of an effect to new WaveTracks. This behavior is desirable for source separation, since a model that separates into 4 sources (Drums, Bass, Voice, and Other), would ideally create 4 new WaveTracks bound to the input track, one track for each source.  Analysis effects (like FindClipping) already create a new label track that gets bound to the input track. I’ll dig deeper into how this is done, and see if I can extend this behavior so a variable number of WaveTracks can be created to write the separation output.

Goals for Next Week:

  • Finish writing the Effect Processing code so each output source is appended to the input WaveTrack. 
  • Start thinking about an approach to writing the separation output to new, multiple WaveTracks.

Source Separation and Extensible MIR Tools for Audacity

Hello! My name is Hugo Flores Garcia. I have been selected to build source separation and other music information retrieval (MIR) tools for Audacity, as a project for Google Summer of Code. My mentor is Dmitry Vedenko from Audacity’s new development team, with Roger from the old team providing assistance.

What does source separation do, anyway?

Source separation would bring many exciting opportunities for Audacity users. The goal of audio source separation is to isolate the sound sources in a given mixture of sounds. For example, a saxophone player may wish to learn the melody to a particular jazz tune, and can use source separation to isolate the saxophone from the rest of the band and learn the part. On the other hand, a user may separate the vocal track from their favorite song to generate a karaoke track. In recent years, source separation has enabled a wide range of applications for people in the audio community, from “upmixing” vintage tracks to cleaning up podcast audio

Source separation aims to isolate individual sounds from the rest of a mixture. It is the opposite of mixing different sounds, which can be a complex, non-linear process. How sources are mixed makes separating them a difficult problem, suitable for deep learning. Image used courtesy of Ethan Manilow, Prem Seetharaman, and Justin Salamon [source].

For an in-depth tutorial on the concepts behind source separation and coding up your own source separation models in Python, I recommend this awesome ISMIR 2020 tutorial

Project Details

This project proposes the integration of deep learning based computer audition tools into Audacity. Though the focus of this project is audio source separation, the framework can be constructed such that the integration of other desirable MIR tools, such as automatic track labeling and tagging, can be later incorporated with relative ease by using the same deep learning infrastructure and simply introducing new interfaces for users to interact with. 

State of the art (SOTA) source separation systems are based on deep learning models. One thing to note is that individual source separation models are designed for specific audio domains. That is, users will have to choose different models for different tasks. For example, a user must pick a speech separation model to separate human speakers, and a music separation model to separate musical instruments in a song. 

Moreover, there can be a tradeoff between separation quality and the size of the model, and larger models will take a considerably longer amount of time to separate audio. This is especially true when users are performing separation without a GPU, which is our expected use case.  We need to find the right balance of quality and performance that will be suitable for most users. That being said, we expect users to have different machines and quality requirements, and want to provide support for a wide variety of potential use cases. 

Because we want to cater to this wide variety of source separation models, I plan on using a  modular approach to incorporating deep models into Audacity, such that different models can be swapped and used for different purposes, as long as the program is aware of the input and output constraints. PyTorch’s torchscript API lets us achieve such a design, as Python models can be exported into “black box” models that can be used in C++ applications. 


With a modular deep learning framework incorporated into Audacity, staying up to date with SOTA source separation models is simple. I plan to work closely with the Northwestern University Source Separation Library (nussl), which is developed and maintained by scientists with a strong presence in the source separation research community.  Creating a bridge between pretrained nussl models and the Audacity source separation interface ensures that users will always have access to the latest models in source separation research. 

Another advantage of this modular design is that it lays all the groundwork necessary for the incorporation of other deep learning-based systems in Audacity, such as speech recognition and automatic audio labeling!

I believe the next generation of audio processing tools will be powered by deep learning. I am excited to introduce the users of the world’s biggest free and open source audio editor to this new class of tools, and look forward to seeing what people around the world will do with audio source separation in Audacity! You can look at my original project proposal here, and keep track of my progress on my fork of Audacity on GitHub. 

Bio

My name is Hugo Flores García. Born and raised in Honduras, I’m a doctoral student at Northwestern University and a member of the Interactive Audio Lab. My research interests include sound event detection, audio source separation, and designing accessible music production and creation interfaces.