Source Separation – GSoC 2021 Week 6

Hi all! 

There aren’t many updates for this week. I spent the past week cleaning out bugs in the model manager related to networking and threading. I hit a block around Wednesday, when the deep learning effect stopped showing up on the Plugin Manager entirely. It took a couple of days for me to figure out , but I’m back on track now, and I’m ready to keep the ball rolling. 

To do:

  • Fix a bug where download progress gauge appears in the bottom left corner of the ModelCardPanel, instead of on top of the install button. 
  • Refactor ModelCard, so that we serialize // deserialize the internal JSON object only when necessary. 
  • Add a top panel for the model manager UI, with the following functionality
    • Search through model cards
    • Filter by 
      • domain (music, speech, etc)
      • Task (separation, enhancement)
      • Other metadata keys
    • Manually add a huggingface repo
  • If a model is installed and there’s a newer version available, let the user know.

Source Separation – GSoC 2021 Week 5

Hi all! Here are this week’s updates: 

I’ve made progress on the Model Manager! Right now, all HuggingFace repositories with the tag “audacity” are downloaded and displayed as model cards (as seen below). If a user chooses to install a model, the model manager queries HuggingFace for the actual model file (the heavy stuff) and installs it into a local directory. This interface lets users choose from a variety of Deep Learning models trained by contributors around the world for a wide variety of applications.

To do: 

  • GUI work
  • searching and filtering between model cards
  • Grab a music separation model!
A prettier GUI coming soon!

GSOC 2021 with Audacity – Week 5

It has been an exciting week for me, after the completion of the first brush tool prototype! Currently each windowed sample will be passed via SpectrumTransformer into the newly added SpectralDataManager, it checks against the previously selected data and zeros out all the selected frequency bin. (Link to the commit)

The brush tool demo

I have chosen “Killing me softly” by Roberta Flack (one of my all-time favorites!), snippet of four seconds has been extracted from the beginning. I have also added a meow sound to it since we all love cats and more importantly, it consists of pitch variation which cannot be effectively selected by the current tool (horizontal line)

To use the brush tool, we simply dragged through the meow sound and its overtones, and click the apply button afterwards, then the selected frequency content will be zeroed out.

The full demo video is available here (with before v.s. after audio comparison):

https://drive.google.com/file/d/1bQJGncHWj_GqD19LOPeEp_og3j70akw8/view?usp=sharing

What’s next?

This is still, rather an early stage for this new feature, there are lots of potential improvements. For instance, we can definitely do better than zeroing out all the selected frequency bins, like average sampling from the non-selected windows (horizontally) or frequencies (vertically), or both!

Moreover, I would also like to make the selection smarter. For photo editing, say we were to remove or isolate subject from the background image, we would have prioritized and relied on tools like magic wand for picking up most of the desired area for us intelligently, then followed by the fine tuning using drawing tool. Being said, I hope that the tool will be able to guess and pick up user’s selection (or at least most of them), then the user can add/remove spectral data from the edges using brush tool.

A step even further will be picking up the overtones automatically for the user, during the “magic wand” stage. However, the overtones can be a bit tricky to calculate, since their shapes are kinda skewed in linear view and we need to take logarithmic scale as reference when performing the computation (User can edit in logarithmic view but we cannot easily “select view” for the computation). Without the introduction of advance area approximation algorithm, a possible way can be sliding the fundamental frequency area across the frequency bins that are close to its multiples, then we can estimate and spot the overtones by calculating their spectral energy similarity.

Source Separation – GSoC 2021 Week 4

Hi all! Here are this week’s updates: 

Though the focus of the project is on making a source separation effect, a lot of the code written for this effect has shown to be generic enough that it can be used with any deep-learning based audio processor, given that it meets certain input-output constraints. Thus, we will be providing a way for researchers and deep learning practitioners to share their source separation (and more!) models with the Audacity community. 

The “Deep Learning Effect” infrastructure can be used with any PyTorch-based models that take a single-channel (multichannel optional) waveform, and output an arbitrary number of audio waveforms, which are then written to output tracks.   

This opens up the opportunity to make available an entire suite of different processors, like speech denoisers, speech enhancers, source separation, audio superresolution, etc., with contributions from the community. People will be able to upload the models they want to contribute to HuggingFace, and we will provide an interface for users to see and download these models from within Audacity. I will be working with nussl to provide wrappers and guidelines for making sure that the uploaded models are compatible with Audacity. 

I met with Ethan from the nussl team, as well as Jouni and Dmitry from the Audacity team. We talked about what the UX design would look like for using the Deep Learning effects in Audacity. In order to make these different models available to users, we plan on designing a package manager-style interface for installing and uninstalling deep models in Audacity. 

I made a basic wireframe of what the model manager UI would look like:

Goals for this week:

  • Work on the backend for the deep model manager in audacity. The manager should be able to 
    • Query HuggingFace for model repos that match certain tags (e.g. “Audacity”). 
    • Keep a collection of these repos, along with their metadata.
    • Search and filter through the repos with respect to different metadata fields.
    • Be able to install and uninstall different models upon request.

GSOC 2021 with Audacity – Week 4

This week I have performed several bug fixes and preparing for the last missing puzzle before the first evaluation – to perform FFT on selected frames, edit the spectral data and reverse it using IFFT, which the functions required have been modularized by Paul on the branch Noise-reduction-refactoring, where the SpectrumTransformer is introduced to the codebase.

Use EXPERIMENTAL flags on brush tool

The brush tool has now been refactored under the experimental flag EXPERIMENTAL_BRUSH_TOOL, which is a custom CMAKE flag during compile time, the features can now be safely merged to existing codebase, team members will be able to test the feature by simply reverse the flag. (Link to commit)

List of bug fixes

After applying the effect, the state is now properly appended to history stack, making the undo and redo possible. (Link to commit)

The trace of brush stroke will be distorted after undo and redo, it has now been fixed. (Link to commit)

The apply effect button will now be correctly triggered, even in waveform view. (Link to commit)

Rebase current brushtool branch and prepare for first deliverable!

The development of brush tool has now been rebased to Noise-reduction-refactoring, currently I am setting up new class called SpectrumDataManager, to encapsulate most of the spectral editing behind the scenes, with a worker class inherited from TrackSpectrumTransformer.

GSOC 2021 with Audacity – Week 3

Multiple features were added in this week to the tool as scheduled, I have spent most of the development time to understand some internal architectures of Audacity.

For instance, how RegisteredFactory provides ways of binding data to a host class to avoid circular dependencies, and how UndoManger utilize multiple layers of polymorphism to achieve complete state backup.

The native color scheme

Referencing to the original selection tool, it blends nicely into the spectrogram view without losing transparency, so the brush tool will also be modified accordingly, providing similar user experience. (Link to commit)

The eraser tool

The basic eraser tool has been added, user will now be able to erase the selected area. This feature is currently triggered by pressing Ctrl while dragging, the detailed designs will be further discussed with the team’s designers. (Link to commit)

The apply effect button

A prototype button has been added to apply different effects onto the selected area in the future (currently it simply removes the selection), it will most likely to be replaced with non-modal dialogue, with optional features UI like brush size slider. (Link to commit)

The redo/undo feature

In the project manager, the undo manger will trigger virtual CopyTo() that are inherited by different derived classes, I have added another method right after it reaches WaveTrackView, since different data from sub-views should also be copied into the state history. Taking the whole Track as an argument into single sub-view seems to be counter-intuitive since we are expecting one-to-one sub-view copying. (Link to commit)

Source Separation – GSoC 2021 Week 3

Hi all! Here are this week’s updates: 

I spent the past week refactoring so that there’s a generic EffectDeepLearning that we can inherit from to use deep learning models for other applications outside source separation (like audio generation or labeling). I also modified the resampling behavior. Instead of resampling the whole track (which can be useless if we’re only processing a 10s selection in a 2hr track), resampling is done directly on each buffer block via a torchscript module borrowed from torchaudio. Additionally, I added a CMake script for including built-in separation models in the Audacity package. 

Goals for next week

UI work

Because the separation models will be used as essentially black boxes (audio mixture in, separated sources out), I don’t think there’s much I can do for the actual effect UI, except for allowing the user to import pretrained models and displaying relevant metadata (sample rate, speech/music, and possibly an indicator of processing speed / separation quality). 

The biggest user interaction happens when the user chooses a deep learning model. The models could potentially be hosted in HuggingFace (https://huggingface.co/). The models can range anywhere from 10MB to upwards of 200MB in size, process audio at different sample rates, and take different amounts of time to compute. We could have a dedicated page in the Audacity website or manual that provides information on how to choose and download separation models, as well as provides links to specific models that are Audacity-ready. 

Ideally, we would like to offer models for different use cases. For example, a user wanting to quickly denoise a recorded lecture would be fine using a lower-quality speech separation model with an 8kHz sample rate. On the other hand, someone trying to upmix a song would probably be willing to sacrifice longer compute time and use a higher quality music separation model with a 48kHz sample rate. 

Build work

I’ve managed to get the Audacity + deep learning build working on Linux and MacOS, but not Windows. I’ll spend some time this week looking into writing a Conan recipe for libtorch that simplifies the build process for all platforms.

GSOC 2021 with Audacity – Week 2

In this week, I have finished the basic prototype of the brush tool, the original data structure designed for storing spectral data has now been refactored. Adapting to agile development, I have setup a Kraken board on GitHub Projects over Jira since most of the team members are already on GitHub, the real-time progress of the project will now be traceable.

The refactorization of the structure

The first design is to access SpectrumView via static variable and it has now been fixed, the data should be local to each SpectrumView, meaning that for each stereo channel, the selection is stored separately, same applies for different tracks.

Link to the commit

Suggested by Paul, we are sticking to the workflow of the SpectrumView::DrawClipSpectrum, the structure has been modified from Frequency -> Time points to Time point -> Frequency Bins.

Link to the commit

The missing cursor coordinates

The mouse events associated with UIHandle is not captured constantly (or frequent enough), meaning that if the user drag the mouse dramatically, some of the coordinates will be missed, considering the following graph, where the dragging speed increases from top to bottom:

An easy cheat will be to connect the last visited coordinate to the current one using wxDC::DrawLine, we can even customize the thickness with single parameter, however it only affects the selected area visually, the continuous coordinates are still missing from the structure. Since it is impossible to capture the pixel-perfect line on our screen, we need algorithm to estimate the line, ideally with customizable thickness, and Bresenham’s line drawing algorithm has been chosen, it will be further modified since we expect the brush to be circular, but it will be good enough for prototyping.

Link to the commit

To be done: the UX and undo/redo

As an ordinary application user, using keyboard shortcuts like Ctrl+Z & Ctrl+Y almost becomes our muscle memory, and of course we expect to have similar functionality for the tool! Since the structure is new to the codebase, and we cannot simply reuse the ModifyState from the ProjectHistory, we will need to inform the base class about copying this new structure when adding to the state history.

Source Separation – GSoC 2021 Week 2

Hi all! Here are this week’s updates: 

I finished writing the processing code! Right now, the Source Separation effect in Audacity is capable of loading deep learning models from disk and using them to perform source separation on a user’s selected audio. When the Source Separation effect is applied to the audio, the effect creates a new track for each separated source. Because Source Separation models tend to operate at lower sample rates (a common sample rate for deep learning models is 16 kHz), each output track is resampled and converted to the same sample format as the original mix track. To name each of the new source tracks, each of the source labels are appended to the Mix’s track name. For example, if my original track is called “mix”, and my sources are [“drums”, “bass”], the new tracks will be named “mix – drums” and “mix – bass”. Here’s a quick demo of my progress so far:

Goals for next week:

  • Each new separated source track should be placed below the original mix. Right now, this is not the case, as all the new tracks are written at the bottom of the tracklist. I’d like to amend this behavior so that each separated source is grouped along with its parent mix.
  • Pass multi-channel audio to deep learning models as a multi-channel tensor. Let the model author decide what to do with multiple channels of audio (i.e. downmix). Write a downmix wrapper in PyTorch. 
  • Refactor processing code. A lot of the preprocessing and postprocessing steps could be abstracted away to make a base DeepLearningEffect class that contains useful methods that make use of deep learning models for different applications (e.g. automatic labeling, sound generation). 
  • Brainstorm different ideas for making a more useful // attractive UI for the Source Separation effect. 

One idea for a Source Separation UI that’s been on the back of my head is to take advantage of Audacity’s Spectral Editing Tools (refer to fellow GSoC project by Edward Hui) to make a Source Separation Mask editor. This means that we would first have a deep learning model estimate a separation mask for a given spectrogram, and then let a user edit and fine-tune the spectral mask using spectral editing tools. This would let a user improve the source separation output through photoshop-like post processing, or even potentially give the model hints about what to separate!

GSOC 2021 with Audacity – Week 1

In this week, I have conducted meetings with my mentor and learned more about the rendering logic, how different methods of inherited UIHandler works together, and we have set the expectation for the following few weeks, I have also completed a prototype of the brush tool.

Works done in this week:

  1. Created BrushHandle, inherited some basic functions and logic.
  2. Tried different approaches for displaying the brush trails in real-time, including rendering from the BrushHandle and SpectrumView respectively.
  3. Setup data structure to store and convert the mouse events into frequency-time bins (adapted automatically to different user scaling).

Next week’s goal:

  1. Change the color of the selected area, to adapt to the existing color gradient scheme
  2. Refactor the data structure of the selected area
  3. Implement new UI components, to erase or apply the editing effect
  4. Append the selection to the state history