Source Separation – GSoC 2021 Week 2

Hi all! Here are this week’s updates: 

I finished writing the processing code! Right now, the Source Separation effect in Audacity is capable of loading deep learning models from disk and using them to perform source separation on a user’s selected audio. When the Source Separation effect is applied to the audio, the effect creates a new track for each separated source. Because Source Separation models tend to operate at lower sample rates (a common sample rate for deep learning models is 16 kHz), each output track is resampled and converted to the same sample format as the original mix track. To name each of the new source tracks, each of the source labels are appended to the Mix’s track name. For example, if my original track is called “mix”, and my sources are [“drums”, “bass”], the new tracks will be named “mix – drums” and “mix – bass”. Here’s a quick demo of my progress so far:

Goals for next week:

  • Each new separated source track should be placed below the original mix. Right now, this is not the case, as all the new tracks are written at the bottom of the tracklist. I’d like to amend this behavior so that each separated source is grouped along with its parent mix.
  • Pass multi-channel audio to deep learning models as a multi-channel tensor. Let the model author decide what to do with multiple channels of audio (i.e. downmix). Write a downmix wrapper in PyTorch. 
  • Refactor processing code. A lot of the preprocessing and postprocessing steps could be abstracted away to make a base DeepLearningEffect class that contains useful methods that make use of deep learning models for different applications (e.g. automatic labeling, sound generation). 
  • Brainstorm different ideas for making a more useful // attractive UI for the Source Separation effect. 

One idea for a Source Separation UI that’s been on the back of my head is to take advantage of Audacity’s Spectral Editing Tools (refer to fellow GSoC project by Edward Hui) to make a Source Separation Mask editor. This means that we would first have a deep learning model estimate a separation mask for a given spectrogram, and then let a user edit and fine-tune the spectral mask using spectral editing tools. This would let a user improve the source separation output through photoshop-like post processing, or even potentially give the model hints about what to separate!