GSoC 2021 – Work Product – Source Separation and Deep Learning Tools

Hi all! Google Summer of Code has wrapped up, and I have mostly completed my work contributing a Source Separation effect and a deep learning toolkit for Audacity. I still have code review fixes to address, but the code is in a fully functional state, and all the proposed features have been completed.

Code Changes

You can view the commit history and code reviews on the Pull Request I submitted to the main Audacity repo.

More Links

Here are links to more information on this project:

Work Product Summary

  • Deep Learning Effects
    • EffectSourceSep: A built-in effect for performing source separation in Audacity. While this effect is technically able to do more than just source separation (the internal effect functions as a generic deep learning processor that can produce a multi-track output given a single-track input), it is branded as Source Separation, as we expect the majority of model contributions to be focused on source separation. 
    • EffectDeepLearning: A base class for a built-in effect that uses PyTorch models. EffectDeepLearning takes care of data type conversions between torch::Tensor and WaveTrack/WaveClip data types. 
    • (In Progress) EffectLabeler: With the help of Aldo Aguilar, we are hoping to contribute an effect capable of performing automatic track labeling. Such an effect would enable users to perform automatic speech-to-text transcription or annotation of different target sounds within a track.
  • Deep Learning Tools: an internal toolkit for managing and using deep learning models anywhere within Audacity. 
    • DeepModelManager: A class for fetching, downloading, installing, and uninstalling deep learning models from HuggingFace repositories.
    • DeepModel and ModelCard
      • DeepModel: a wrapper class for PyTorch models. Loads an internal resampling module, which is used for resampling input audio to the model’s sample rate, and resampling output audio back to Audacity’s project rate. Takes care of exception handling during if loading the model fails, as well as internal errors during the model’s forward pass. 
      • ModelCard: class for holding model metadata.
  • Deep Model Manager UI: GUI elements for interacting with deep learning models hosted in HuggingFace. 
    • ManagerToolsPanel: The top panel, as seen on the image above. Contains controls for exploring models in HuggingFace and importing them onto the Model Manager UI.
    • ModelCardPanel scroller: a scroller for navigating through the fetched models. Contains a short description of the model’s purpose, as well as a color-coded tag meant to inform the user of the model’s intended data domain (that is, models tagged with “music” are meant to be used with music data, while models that with “speech” are meant to be used with speech data). 
    • DetailedModelCardPanel: a detailed view for deep models. Contains a longer description, model sample rate, additional tags, and a button that links to the HuggingFace repo’s README file, for even more information on the model.

Future Work

  • Finish addressing code review this week
  • Extract internal deep learning utilities to lib-deeplearning
  • Open a PR that incorporates EffectLabeler for deep learning-based sound labeling and tagging within Audacity

Special thanks to Bryan Pardo, Ethan Manilow, and Aldo Aguilar from the Interactive Audio Lab, as well as Dmitry Vedenko from the Audacity team for all the helpful discussions and support I received throughout the project. I hope my contribution to Audacity provides the groundwork for a bringing a new wave of effects based on deep learning to the hands of audio users.

Source Separation – GSoC 2021 Week 9

Hi all!

GSoC is starting to wrap up, and I’ve created two project boards to finalize the last of the work that needs to be completed to wrap up the Deep Learning Tools project for Audacity. The first project board is concerned with pending bug fixes and enhancements for the internal functionality of the Deep Model Manager (see the github link). The second board is concerned with improving the UI for model selection (see the github link). All of the high priority tasks in the first project board are done, and I am planning to finish both project boards by the end of the week (with help from Aldo Aguilar in the interactive audio lab). 

The manager UI will contain a new detailed view for ModelCards that offers a link for opening the model in HuggingFace, as well as a longer description of the model within Audacity. Additionally, using colored domain tags should help users pick the right model with more ease. 

Source Separation – GSoC 2021 Week 8

Hi all! Here are some updates for this week. 

  • I cleaned up the commit history for the deep learning implementation and opened a pull request in the official audacity repo. 
  • Added a dialog for manually specifying a HuggingFace repo to fetch (github). 
  • Fixed a bug where ModelCards weren’t scrollable until the user manually resized the window (github).
  • Amended the download behavior so the downloaded model file is written to file incrementally, lowering memory consumption (github). 
  • Added sorting to ModelCard panels (github).
  • Fixed several other bugs in the Model Manger and its UI (github).

To do

  • Start writing documentation for model contributors. The documentation should provide instructions on how to properly structure a HuggingFace repo for an audacity model, write a metadata file, and properly export the deep model to torchscript, ensuring that it meets the input/output constraints in Audacity. 
  • Continue to fix open issues with the model manager. 
  • Make ModelCards collapsible. Right now, only 2-3 can be shown on screen at a time. It may be a good idea to offer a collapsed view of the ModelCard. 
  • Provide a hyperlink (or a more info button) that points to the model’s HuggingFace readme somewhere in the ModelCard panel, so users can view more information about the model online (e.g. datasets, benchmarks, performance, examples).

Source Separation – GSoC 2021 Week 7

Hi all! Here are some updates for this week:

  • The issue related to the download progress gauge appearing on the bottom corner has been fixed, though the size of the gauge itself still needs tweaking. 
  • In order to let the user know how large a model is prior to installing, model cards now show the model’s file size.
  • ModelCard (a class for containing model metadata) was refactored last week so that it doesn’t hold on to the JSON document, but rather serializes/deserializes only when downloading from HuggingFace or installing to disk.
  • I’ve started work on a top panel for the model manager UI, which will contain the controls for refreshing repos, searching and filtering, as well as manually adding a repo

In other news, Aldo Aguilar from the Interactive Audio Lab has been working on a Labeler effect built using EffectDeepLearning that will be capable of creating a label track with annotations for a given audio track. Possible applications of this effect include music tagging and speech-to-text, given that we can find pretrained models for both tasks. 

To do

  • Continue work on the top panel for the model manager UI. 
  • Right now, the response content for deep models is all held in memory at once while installing. This causes an unnecessary amount of memory consumption. Instead we want to incrementally write the response data to disk. 
  • Dmitry pointed out that the deep model’s forward pass is blocking the UI thread, since it can process large selections of audio at a time. Though a straightforward solution is to cut up the audio into smaller chunks, some deep learning models require a longer context window and/or are non-causal. I will spend more time investigating potential solutions to this. 
  • Layout work for model manager UI. Right now, most elements look out of place. I haven’t spent as much time on this because I’d like to finish writing the core logic of the DeepModelManager before digging into the details of the UI. 

Source Separation – GSoC 2021 Week 6

Hi all! 

There aren’t many updates for this week. I spent the past week cleaning out bugs in the model manager related to networking and threading. I hit a block around Wednesday, when the deep learning effect stopped showing up on the Plugin Manager entirely. It took a couple of days for me to figure out , but I’m back on track now, and I’m ready to keep the ball rolling. 

To do:

  • Fix a bug where download progress gauge appears in the bottom left corner of the ModelCardPanel, instead of on top of the install button. 
  • Refactor ModelCard, so that we serialize // deserialize the internal JSON object only when necessary. 
  • Add a top panel for the model manager UI, with the following functionality
    • Search through model cards
    • Filter by 
      • domain (music, speech, etc)
      • Task (separation, enhancement)
      • Other metadata keys
    • Manually add a huggingface repo
  • If a model is installed and there’s a newer version available, let the user know.

Source Separation – GSoC 2021 Week 5

Hi all! Here are this week’s updates: 

I’ve made progress on the Model Manager! Right now, all HuggingFace repositories with the tag “audacity” are downloaded and displayed as model cards (as seen below). If a user chooses to install a model, the model manager queries HuggingFace for the actual model file (the heavy stuff) and installs it into a local directory. This interface lets users choose from a variety of Deep Learning models trained by contributors around the world for a wide variety of applications.

To do: 

  • GUI work
  • searching and filtering between model cards
  • Grab a music separation model!
A prettier GUI coming soon!

Source Separation – GSoC 2021 Week 4

Hi all! Here are this week’s updates: 

Though the focus of the project is on making a source separation effect, a lot of the code written for this effect has shown to be generic enough that it can be used with any deep-learning based audio processor, given that it meets certain input-output constraints. Thus, we will be providing a way for researchers and deep learning practitioners to share their source separation (and more!) models with the Audacity community. 

The “Deep Learning Effect” infrastructure can be used with any PyTorch-based models that take a single-channel (multichannel optional) waveform, and output an arbitrary number of audio waveforms, which are then written to output tracks.   

This opens up the opportunity to make available an entire suite of different processors, like speech denoisers, speech enhancers, source separation, audio superresolution, etc., with contributions from the community. People will be able to upload the models they want to contribute to HuggingFace, and we will provide an interface for users to see and download these models from within Audacity. I will be working with nussl to provide wrappers and guidelines for making sure that the uploaded models are compatible with Audacity. 

I met with Ethan from the nussl team, as well as Jouni and Dmitry from the Audacity team. We talked about what the UX design would look like for using the Deep Learning effects in Audacity. In order to make these different models available to users, we plan on designing a package manager-style interface for installing and uninstalling deep models in Audacity. 

I made a basic wireframe of what the model manager UI would look like:

Goals for this week:

  • Work on the backend for the deep model manager in audacity. The manager should be able to 
    • Query HuggingFace for model repos that match certain tags (e.g. “Audacity”). 
    • Keep a collection of these repos, along with their metadata.
    • Search and filter through the repos with respect to different metadata fields.
    • Be able to install and uninstall different models upon request.

Source Separation – GSoC 2021 Week 3

Hi all! Here are this week’s updates: 

I spent the past week refactoring so that there’s a generic EffectDeepLearning that we can inherit from to use deep learning models for other applications outside source separation (like audio generation or labeling). I also modified the resampling behavior. Instead of resampling the whole track (which can be useless if we’re only processing a 10s selection in a 2hr track), resampling is done directly on each buffer block via a torchscript module borrowed from torchaudio. Additionally, I added a CMake script for including built-in separation models in the Audacity package. 

Goals for next week

UI work

Because the separation models will be used as essentially black boxes (audio mixture in, separated sources out), I don’t think there’s much I can do for the actual effect UI, except for allowing the user to import pretrained models and displaying relevant metadata (sample rate, speech/music, and possibly an indicator of processing speed / separation quality). 

The biggest user interaction happens when the user chooses a deep learning model. The models could potentially be hosted in HuggingFace (https://huggingface.co/). The models can range anywhere from 10MB to upwards of 200MB in size, process audio at different sample rates, and take different amounts of time to compute. We could have a dedicated page in the Audacity website or manual that provides information on how to choose and download separation models, as well as provides links to specific models that are Audacity-ready. 

Ideally, we would like to offer models for different use cases. For example, a user wanting to quickly denoise a recorded lecture would be fine using a lower-quality speech separation model with an 8kHz sample rate. On the other hand, someone trying to upmix a song would probably be willing to sacrifice longer compute time and use a higher quality music separation model with a 48kHz sample rate. 

Build work

I’ve managed to get the Audacity + deep learning build working on Linux and MacOS, but not Windows. I’ll spend some time this week looking into writing a Conan recipe for libtorch that simplifies the build process for all platforms.

Source Separation – GSoC 2021 Week 2

Hi all! Here are this week’s updates: 

I finished writing the processing code! Right now, the Source Separation effect in Audacity is capable of loading deep learning models from disk and using them to perform source separation on a user’s selected audio. When the Source Separation effect is applied to the audio, the effect creates a new track for each separated source. Because Source Separation models tend to operate at lower sample rates (a common sample rate for deep learning models is 16 kHz), each output track is resampled and converted to the same sample format as the original mix track. To name each of the new source tracks, each of the source labels are appended to the Mix’s track name. For example, if my original track is called “mix”, and my sources are [“drums”, “bass”], the new tracks will be named “mix – drums” and “mix – bass”. Here’s a quick demo of my progress so far:

Goals for next week:

  • Each new separated source track should be placed below the original mix. Right now, this is not the case, as all the new tracks are written at the bottom of the tracklist. I’d like to amend this behavior so that each separated source is grouped along with its parent mix.
  • Pass multi-channel audio to deep learning models as a multi-channel tensor. Let the model author decide what to do with multiple channels of audio (i.e. downmix). Write a downmix wrapper in PyTorch. 
  • Refactor processing code. A lot of the preprocessing and postprocessing steps could be abstracted away to make a base DeepLearningEffect class that contains useful methods that make use of deep learning models for different applications (e.g. automatic labeling, sound generation). 
  • Brainstorm different ideas for making a more useful // attractive UI for the Source Separation effect. 

One idea for a Source Separation UI that’s been on the back of my head is to take advantage of Audacity’s Spectral Editing Tools (refer to fellow GSoC project by Edward Hui) to make a Source Separation Mask editor. This means that we would first have a deep learning model estimate a separation mask for a given spectrogram, and then let a user edit and fine-tune the spectral mask using spectral editing tools. This would let a user improve the source separation output through photoshop-like post processing, or even potentially give the model hints about what to separate!

Source Separation – GSoC 2021 Week 1

Hi all! My first week of GSoC went great. Here are some project updates:

I started prototyping some wrappers to export pretrained PyTorch source separation models for use in Audacity. The pretrained models will most likely be grabbed from Asteroid, an open source library with lots of recipes for training state of the art source separation models. Most Asteroid models are torchscript compatible via tracing (see this pull request), and I’ve already successfully traced a couple of ConvTasNet models trained for speech separation that should be ready for Audacity. You can look at these wrapper prototypes in this Colab notebook. The idea is to ship the model with JSON encoded metadata that we can display to the user. This way, we can inform a user of a model’s domain (e.g. speech // music), sample rate, size (larger models require a larger amount of compute), and output sources (e.g. Bass, Drums, Voice, Other).

Wrapping a model for Audacity should be straightforward for people familiar with source separation in PyTorch, and I’m planning on adding a small set of wrappers to nussl that facilitate the process, accompanied by a short tutorial. Ideally, this should encourage the research community to share their groundbreaking source separation models with the Audacity community, giving us access to the latest and greatest source separation! 🙂 

On the Audacity side, I added a CMake script that adds libtorch to audacity. However, because CPU-only libtorch is a whopping 200MB, source separation will likely be an optional feature, which means that libtorch needs to be an optional download that can be linked at runtime.  I will be in touch with my mentor Dmitry Vedenko about figuring out a way forward from there. 

I started writing code for the SourceSep effect in Audacity. So far, I’m able to load torchscript models into the effect, and display each model’s metadata.  By next week, I’d like to finish writing the processing code, where the audio data is converted from Audacity’s WaveTrack to a Torch Tensor, processed by the torchscript model, and the output is converted back to a WaveTrack.

Audacity’s Effect interface lacks the capability to write the output of an effect to new WaveTracks. This behavior is desirable for source separation, since a model that separates into 4 sources (Drums, Bass, Voice, and Other), would ideally create 4 new WaveTracks bound to the input track, one track for each source.  Analysis effects (like FindClipping) already create a new label track that gets bound to the input track. I’ll dig deeper into how this is done, and see if I can extend this behavior so a variable number of WaveTracks can be created to write the separation output.

Goals for Next Week:

  • Finish writing the Effect Processing code so each output source is appended to the input WaveTrack. 
  • Start thinking about an approach to writing the separation output to new, multiple WaveTracks.