Classify Drum audio samples through the use of Artifical Intelligence / Machine Learning.
The Drum Audio Classifier, uses a Convolutional Neural Network to predict the most likely drum type of a audio file. The dataset used to create this model was 2,700+ of my freelance music production audio samples.
The advatange of a CNN model for audio classification is that the model works with spacial transformations and when dimensional data is offset. This means the shape of the audio data over time and frequency is modelled, not the exact locations. If a "Kick Drum" audio sample is higher pitches than the modelled samples, it will still register as a Kick Drum. If a "Snare Drum" sample doesn't strike until 2-3 seconds in, it will still register as a Snare Drum.
A streamlit demo is available on Huggingface Spaces. You may test your own drum samples, or use the samples provided on streamlit.
Dependencies: Python 3.10, matplotlib 3.7.0, pandas 1.5.3, librosa 0.10.0, sklearn 0.0, numpy 1.23.5, tensorflow 2.10.0, IPython 8.10.0
I would suggest using Jupyter Notebooks and following the Cloud Demo Steps.
- The saved model only classifies: Kick Drum, Snare Drum, Closed Hat Cymbal, Open Hat Cymabl and Clap Drum sample.
- The saved model has a Validation Loss of 5.00+. This implies overfitting and that the saved model is "memorizing" patterns. In practice I did not see problems with classifying samples; however, you mileage may vary.
The raw dataset is directly from H3 Music Corp, consisting of 2,746 drum audio samples with a CSV File associating the file with a drum type classification. These files were provided and maintained in Uncompressed WAV file formats. The reasoning for this is for development and post-development it is easier for users and developers to verify and interact with audio files rather than arrays. If a developer needs to verify the type of drum sample, they can just listen to it, removing a lot of hassle. The raw Uncompressed WAV Files are available in the dataset/samples folder. The CSV File to associate types is available in the dataset folder, titled “samples_metadata.csv”. The "drum_prediction.ipynb" file converts the WAV files into arrays of Amplification at Pitches (0 - 128) over Time.
It is important to note that CNNs are intended to use image data not audio data. To remedy this audio amplification data has to be duplicatd across all 3 color channels (rgb). This makes the CNN "think" it is looking at a color image. Instead of height and width dimensions, the model uses pitch and time as the dimensions.
Massive thank you to Dr. Papia Nandi and her work on CNNs for Audio Classification.