Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1210 ] Gluon Audio #13241

Closed
wants to merge 20 commits into from
Closed

Conversation

gaurav-gireesh
Copy link
Contributor

@gaurav-gireesh gaurav-gireesh commented Nov 13, 2018

Description

Phase 1

As a user, I would like to have an out of the box feature of audio data loader and some popular audio transforms in MXNet, that would allow me :

  • to be able to load audio (only .wav files supported currently) files and make a Gluon AudioDataset (NDArrays),
  • apply some popular audio transforms on the audio data( example scaling, MEL, MFCC etc.),
  • load the Dataset using Gluon's DataLoader, train a neural network ( Ex: MLP) with this transformed audio dataset,
  • perform a simple audio data related task such as sounds classification - 1 audio clip with 1 label( Multiclass sound classification problem).
  • have an end to end example for a sample Audio multi class classification task (Urban Sounds Classification)

Note: The plan is to have a working piece of this model with an example into the contrib package of MXNet before it is agreed upon to move this implementation to the gluon.data module.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness.
  • Code is well-documented
  • For user-facing API changes, API doc string has been updated.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Example contributed

Changes

  • AudioFolderDataset(gluon.contrib.data.audio.datasets),
  • audio transforms (gluon.contrib.data.audio.transforms)

Comments

  • This PR is a WIP. I will be adding tests for the modules added along with the documentation and an example(Notebook or a python script)
  • This feature allows to take first step in audio data related tasks and needs to be extended for more generic as well as advanced use cases - (Reason for contributing in the gluon.contrib package)
  • Opened this PR for early feedback regarding:
    • librosa - which is a dependency for this for audio load and feature extraction like mfcc, scale, mel etc.. This may have to be listed as a dependency in the CI script, or the tests will fail. Suggestions about where to include this or a work around is appreciated.
    • Design here- Feedback is required - Dev list discussion here

@gaurav-gireesh gaurav-gireesh requested a review from szha as a code owner November 13, 2018 01:13
@gaurav-gireesh
Copy link
Contributor Author

@mxnet-label-bot add [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Nov 13, 2018
@zachgk
Copy link
Contributor

zachgk commented Nov 13, 2018

@mxnet-label-bot add [Python, Gluon, Feature]

@gaurav-gireesh
Copy link
Contributor Author

@marcoabreu Would appreciate your feedback regarding a way to add dependencies of librosa and sklearn libraries for this PR Marco. I use these libraries to read/load audio files and sklearn's label encoding. The CI is failing because these dependencies are not satisfied. I have not written test cases still the imports are failing. Thanks!

@marcoabreu
Copy link
Contributor

Hi, you probably want to have a look at /~https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_python.sh

But you got a good point here. MXNet should not fail if we don't actually require these dependencies for the workload we have. Thus, I'd suggest that you first try to determine why we get these errors in the first place.

The last thing we want is to add a hard dependency (in this case on librosa and sklearn) although people might don't even want to use it.

@roywei
Copy link
Member

roywei commented Nov 13, 2018

@gaurav-gireesh It's probably due to some code like `import gluon.dataset.*" automatically imported your audio dataset which requires librosa and sklearn. We should make sure if users don't need to use audio dataset, they don't need to install librosa and sklearn to build and use mxnet normally.

Copy link
Member

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Great work!
I would like to remove the dependency on sklearn, we don't need to depend on it for a small feature like label encoder.
Please also add unit tests and examples/tutorials when you have them!

label_tmp.append(line.split(",")[1].strip())
data_tmp = data_tmp[1:]
label_tmp = label_tmp[1:]
le = sklearn.preprocessing.LabelEncoder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not follow ImageFolderDataset to generate synset? why is sklearn LabelEncoder a must here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am working to find a workaround here. We needed label encoding here because we read from a csv file that contains filename(location) and the corresponding label. It is different from the use case in ImageFolderDataset that requires each class(or label) to be a folder( with images inside them), and hence Folder can be encountered once and put in the synsets(no repetition there). However, in the csv file, a label can occur multiple times. Looking into this for a better way to get around this.
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this.

if not librosa:
raise RuntimeError("Librosa dependency is not installed! Install that and retry")

audio_tmp = np.mean(librosa.feature.mfcc(y=x.asnumpy(), sr=22050, n_mfcc=40).T, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parameterize sr and n_mfcc instead of hard coding values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, addressed this. Thanks


"""

def __init__(self, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kwargs are available? as a new user, how do I know what can be passed? can we parameterize and document them well so users don't have to go over librosa documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a description in the docstring about the parameters that could be passed to the function.

@gaurav-gireesh
Copy link
Contributor Author

@safrooze @sandeep-krishnamurthy Could you help with the review of this Pull request?
Would appreciate your comments and suggestions.
Here is the design proposal: https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio

root : str
Path to root directory.
transform : callable, default None
A function that takes data and label and transforms them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gluon is not going to support this argument anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will remove the argument. I am using transform_first() function instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha - Can you please add more details? Why dataset should not support transform and what is the alternative?


class Loader(Block):
"""
This transform opens a filepath and converts that into an NDArray using librosa to load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how we usually define transform functions in Gluon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the Loader transform.

…nsforms addressing PR comments, removed transform argument from datasets - AudioFolderDataset
Copy link
Contributor

@vandanavk vandanavk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also execute pylint and rst-lint on all these files and fix the errors reported

**https://mxnet.incubator.apache.org/install/ **
4. Librosa is installed. To install, follow the instructions here:
**https://librosa.github.io/librosa/install.html**

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move these instructions to a README file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a README for the example. Thanks!

except ImportError as er:
warnings.warn("Argument parsing module could not be imported and hence \
no arguments passed to the script can actually be parsed.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more concise warning would be good. Maybe it should fail on ImportError?
Also, what happens to argparse.ArgumentParser below when this import fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, I pass some default arguments. There is a check which returns if directories are empty.

if args.epochs:
epochs = args.epochs
else:
epochs = 35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default number of epochs is 35 here but but 30 in train() above. this should be same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thanks.

file_format: str, default '.wav'
The format of the audio files(.wav, .mp3)
skip_rows: int, default 0
While reading from csv file, how many rows to skip at the start of the file to aboid reading in header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - 'avoid'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this.

"""
if not self._has_csv:
self.synsets = []
self.items = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 lines can be common as they appear in both the if and else part

The directory that contains the audio files on which predictions are to be made
"""
if not librosa:
warnings.warn("Librosa dependency not installed! Cnnot load the audio to make predictions. Exitting.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some typos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the typos.

x = x.asnumpy()
specs = librosa.feature.melspectrogram(x, **self.kwargs)
return nd.array(specs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Making this consistent with the other scripts.

@gaurav-gireesh gaurav-gireesh changed the title [MXNET-1210 ][WIP] Gluon Audio [MXNET-1210 ] Gluon Audio Nov 16, 2018
@gaurav-gireesh
Copy link
Contributor Author

@mxnet-label-bot remove [pr-work-in-progress]

@marcoabreu marcoabreu removed the pr-work-in-progress PR is still work in progress label Nov 16, 2018
@gaurav-gireesh
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Nov 16, 2018
To be able to run this example:

1. Download the dataset(train.zip, test.zip) required for this example from the location:
**https://drive.google.com/drive/folders/0By0bAi7hOBAFUHVXd1JCN3MwTEU**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this drive folder? Who owns this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is from an urban sounds challenge/contest hosted by https://datahack.analyticsvidhya.com/contest/practice-problem-urban-sound-classification/
It was a free registration contest. The data is hosted on an open Google drive link.
If needed these could be hosted on a public s3?
or
can we add steps as this:
/~https://github.com/apache/incubator-mxnet/blob/master/example/gluon/house_prices/kaggle_k_fold_cross_validation.py

3. Apache MXNet is installed on the machine. For instructions, go to the link: **https://mxnet.incubator.apache.org/install/**

4. Librosa is installed. To install, follow the instructions here:
**https://librosa.github.io/librosa/install.html**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can provide the pip command and avoid user going through another page.

pip install librosa
For more details refer here - https://librosa.github.io/librosa/install.html

@@ -0,0 +1,16 @@
# Urban Sounds classification in MXNet

Urban Sounds Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be good to add what this dataset is and what is the problem being solved.

# under the License.

"""
This module builds a model an MLP woth a configurable output layer( number of units in the last layer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: woth -> with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should it be configurable? Different number of classes. can you make it more clear.

with net.name_scope():
net.add(gluon.nn.Dense(256, activation="relu")) # 1st layer (256 nodes)
net.add(gluon.nn.Dense(256, activation="relu")) # 2nd hidden layer
net.add(gluon.nn.Dense(num_labels))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why out of name_scope ?

return nd.array(X1), label

else:
warnings.warn(" Dependency librosa is not installed! \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these checks are already done in init?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, user does not mandatorily need to use the audio module, these import failures cannot be raised as an exception. In init(), it issues a warning if librosa not found and returns. However, if get_item() is called, it will not be able to librosa load the audio to numpy arrays. These do not fail but issues a warning to the user, that librosa is needed for those functionalities. The module import of librosa is to be guarded as librosa is not supposed to be a hard dependency for MXNet.
What do you recommend? Should creation of AudioFolderDataset() error out if librosa not installed?

super(MFCC, self).__init__()

def forward(self, x):
if not librosa:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is already done at the top

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Not all the transforms require librosa like Scaling and Pad/Trim. so, testing the availability of librosa in the methods that actually use it and warning the user that this particular transform cannot be applied. What do you recommend?

elif isinstance(x, nd.NDArray):
y = x.asnumpy()
else:
warnings.warn("Input object is not numpy or NDArray... Cannot apply the transform: MFCC!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messaging can be better

from ..... import ndarray as nd
from ....block import Block

class MFCC(Block):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these transforms are too specific to librosa. It takes librosa parameters as kwargs. What if we change backend from librosa to other library or implement them as mxnet operator?

Can we conceptualize these transforms and define the standard parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It makes sense to do what you have suggested. Some other frameworks pass along all the keywords arguments so this approach was chosen.

from common import with_seed


@with_seed()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add tests for the new audio folder dataset?


def predict(pred_dir='./Test'):
"""
The function is used to run predictions on the audio files in the directory `pred_directory`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this line for rst_lint errors


def _list_audio_files(self, root, skip_rows=0):
"""
Populates synsets - a map of index to label for the data items.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check with rst_lint for errors here

filename = self.items[idx][0]
label = self.items[idx][1]

if librosa is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just if librosa ?

Retrieve the item (data, label) stored at idx in items
"""
filename = self.items[idx][0]
label = self.items[idx][1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename, label = self.items[idx] ?


def __len__(self):
"""
Retrieves the number of items in the dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check rst_lint here

Parameters
----------
fn : callable
A transformer function that takes the first elemtn of a sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

class MFCC(Block):
"""
Extracts Mel frequency cepstrum coefficients from the audio data file
More details : https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check rst-lint here

x = x.asnumpy()
specs = librosa.feature.melspectrogram(x, **self.kwargs)
return nd.array(specs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline at the end

@with_seed()
def test_pad_trim():
"""
Function to test Pad/Trim Audio transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check rst-lint here

@with_seed()
def test_scale():
"""
Function to test scaling of the audio transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please execute pylint and rst-lint on all these files.

n_fft: int, default 2048
length of the Fast fourier transform window
n_mels: int,
number of mel bands to generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a default value?

class MEL(Block):
"""Create MEL Spectrograms from a raw audio signal. Relatively pretty slow.

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rst expects class attributes to be under the "Attributes" section. "Parameters" for function params. Please double check this.

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns about the current code are:

  1. not enough use cases to validate the API.
  2. API design not conforming to Gluon standards.
  3. implementation efficiency.

For now, I'd suggest putting these code as one of the Gluon examples to receive feedbacks. Also, make sure that code that isn't used in the example doesn't get checked in.

@sandeep-krishnamurthy
Copy link
Contributor

@szha
Can you please provide more details on the following points.

  1. API design not conforming to Gluon standards.
  2. Implementation efficiency.

What can be the actionable feedback we can provide here to bring it to Contrib?

@szha
Copy link
Member

szha commented Nov 19, 2018

I chatted with @gaurav-gireesh and agreed on having these classes in the examples first. Once two more models are added we can consider making it part of API. Right now there's lack of evidence that these thin wrappers of librosa won't become bottleneck for proper audio-related applications such as ASR, and the compromise to have this in examples is a way to collect more use cases. I'd suggest having DeepSpeech2 and 3 and at least one ASR model with proper WFST models as validation.

@sandeep-krishnamurthy if you know better feel free to suggest.

@szha
Copy link
Member

szha commented Nov 20, 2018

#13325

@szha szha closed this Nov 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants