Source Separation – GSoC 2021 Week 1

Hi all! My first week of GSoC went great. Here are some project updates:

I started prototyping some wrappers to export pretrained PyTorch source separation models for use in Audacity. The pretrained models will most likely be grabbed from Asteroid, an open source library with lots of recipes for training state of the art source separation models. Most Asteroid models are torchscript compatible via tracing (see this pull request), and I’ve already successfully traced a couple of ConvTasNet models trained for speech separation that should be ready for Audacity. You can look at these wrapper prototypes in this Colab notebook. The idea is to ship the model with JSON encoded metadata that we can display to the user. This way, we can inform a user of a model’s domain (e.g. speech // music), sample rate, size (larger models require a larger amount of compute), and output sources (e.g. Bass, Drums, Voice, Other).

Wrapping a model for Audacity should be straightforward for people familiar with source separation in PyTorch, and I’m planning on adding a small set of wrappers to nussl that facilitate the process, accompanied by a short tutorial. Ideally, this should encourage the research community to share their groundbreaking source separation models with the Audacity community, giving us access to the latest and greatest source separation! 🙂 

On the Audacity side, I added a CMake script that adds libtorch to audacity. However, because CPU-only libtorch is a whopping 200MB, source separation will likely be an optional feature, which means that libtorch needs to be an optional download that can be linked at runtime.  I will be in touch with my mentor Dmitry Vedenko about figuring out a way forward from there. 

I started writing code for the SourceSep effect in Audacity. So far, I’m able to load torchscript models into the effect, and display each model’s metadata.  By next week, I’d like to finish writing the processing code, where the audio data is converted from Audacity’s WaveTrack to a Torch Tensor, processed by the torchscript model, and the output is converted back to a WaveTrack.

Audacity’s Effect interface lacks the capability to write the output of an effect to new WaveTracks. This behavior is desirable for source separation, since a model that separates into 4 sources (Drums, Bass, Voice, and Other), would ideally create 4 new WaveTracks bound to the input track, one track for each source.  Analysis effects (like FindClipping) already create a new label track that gets bound to the input track. I’ll dig deeper into how this is done, and see if I can extend this behavior so a variable number of WaveTracks can be created to write the separation output.

Goals for Next Week:

  • Finish writing the Effect Processing code so each output source is appended to the input WaveTrack. 
  • Start thinking about an approach to writing the separation output to new, multiple WaveTracks.