Review of Evaluation Metrics

In this summary, I would like to list evaluation metrics of Music Generation papers and their summaries.

Distribution of Pitches

Idea comes from Folk-RNN paper. They compare distribution of pitches of datasets and generated outputs. I think, we can easily implement this metric. (This is unary feature.)
Distribution of Number of Tokens

Idea comes from Folk-RNN paper. They compare number of tokens in a song (how many token we have until the token which represent end of the sequence) from dataset and generated outputs.

Ps. We can add end token into our dataset to understand when the sequence is end.
Each section will end with a resolution

When researcher checks the output of the Folk-Rnn, they realize that each section will end with a resolution. We can use this type of spesific metric for our case.
Transition Matrix of Pitch and Duration

The idea comes from Algorithmic Composition of Melodies with Deep Recurrent Neural Networks, they compare the transition matrix of pitch and duration. We can easily implement it and compare. (This is bi-gram feature.)
Conservation of Metric Structure

Idea comes from Algorithmic Composition of Melodies with Deep Recurrent Neural Networks, they use Irish Music as dataset and realized that system learns that from a rhythmical point of view, it is interesting to notice that, even though the model had no notion of bars implemented, the metric structure was preserved in the generated continuations.

Ps. Andre Holzapfel's paper which is called Relation between surface rhythm and rhythmic modes in Turkish makam music can be helpful to understand metric structure of Turkish Makam Music
Mutual Information with Time

I saw this idea at Music Generation with Variational Recurrent Autoencoder Supported by History. Main source of this metric is Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language
- Introduction of the paper: We show that in many data sequences — from texts in different languages to melodies and genomes — the mutual information between two symbols decays roughly like a power law with the number of symbols in between the two. In contrast, we prove that Markov hidden Markov processes generically exhibit exponential decay in their mutual information, which explains why natural languages are poorly approximated by Markov processes. We present a broad class of models that naturally reproduce this critical behavior.
- This stackoverflow question can be helpful.
- We can use this library to compute Mutual Information.
- Ps. Mutual information is a quantity that measures a relationship between two random variables that are sampled simultaneously. In particular, it measures how much information is communicated, on average, in one random variable about another
Cross Entropy

Idea comes from Music Generation with Variational Recurrent Autoencoder Supported by History They compare cross entropy of the architestures near saturation point

Note that, 9-10-11 comes from C-RNN-GAN. Their implementation is available in the repo.

Scale Consistency

Scale Consistency computed by counting the fraction of tones that were part of a standard scale, and reporting the number for the best matching such scale.
Tone span

Tone span is the number of half-tone steps between the lowest and the highest tone in a sample.
In our case, we should set the unit to Koma-53 intervals, instead of semitones

Repetitions

Repetitions of short subsequences were counted, giving a score on how much recurrence there is in a sample. This metric takes only the tones and their order into account, not their timing.

Qualified note rate(QN)

Idea comes from CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS WITH BINARY NEURONS FOR POLYPHONIC MUSIC GENERATION Qualified note rate (QN) computes the ratio of the number of the qualified notes (notes no shorter than three time steps, i.e., a 32th note) to the total number of notes. Low QN implies overly-fragmented music.
Average pitch interval (PI)

Average value of the interval between two consecutive pitches in semitones. The output is a scalar for each sample.

For our case semitones -> TET-53
Note Count

The number of used notes. As opposed to the pitch count, the note count does not contain pitch information but is a rhythm-related feature. The output is a scalar for each sample.
Average inter-onset-interval (IOI)

To calculate the inter-onset-interval in the symbolic music domain, we find the time between two consecutive notes. The output is a scalar in seconds for each sample.
Pitch range (PR)

The pitch range is calculated by subtraction of the highest and lowest used pitch in semitones. The output is a scalar for each sample

Now, lets look some metrics from TUNING RECURRENT NEURAL NETWORKS WITH REINFORCEMENT LEARNING These are based on music theory rules.

Notes Excessively Repeated
Notes not in scale (This is good to report visually in melody bigrams, i.e. coloring the notes not in the scale differently.)
Melodies starting with tonic
- In makams, starting not with the tonic (karar), but the initial/dominant (başlangıç/güçlü) note is important. We need to eleborate this explanation.
Melodies with unique min and max note
Notes in motif
Notes in repeated motif
Leaps Resolved

In our first meeting, we also discussed following metrics:

Makam Classification
- For this one, we can use T-SNE and UMAP as unsupervised classification. For my graduation project, we have tried it.
  - Method: The first feature we came up with was the logarithm of frequency distance between decision note and every other note matrix for each piece. We accepted decision note of a piece as the very last note of each piece. After that, since each song included a variety of notes of different bemol and diez degrees, we created a dictionary for each variation of note in specific octave with specific bemol or diez degree. And to create the resulting matrix, we calculated the Euclidean metric distance between each note and decision note to form the draft of our first feature set. And afterward, we transformed the distance matrix for each song into a fixed sized distance matrix which was of the same format for each song. The resulting matrix was the first feature set of our feature matrix.
    
    As a second feature, we formed note frequency histograms for each song. And bins of these frequency histograms were the second feature set of our feature matrix.
    
    Lastly, we created a one-hot matrix for the type of the makam song. Each song in different makam was written with a different method and by using the information of the song, we extracted the method it was written in. Afterward, for each method, we used a different feature set to represent the song altogether with previous feature sets.
    
    After these transformations, we have applied our unsupervised methods for makam classification. We have also method and form, however, according to experts of Turkish Makam Music makam is the most important classifier for the emotion.
  - Results
Usul Classification (Probably, we have not much time for this)
User studies (Probably, we have not much time for this)
Note Distribution of the first section, second section etc.

TO-DO

Read On the evaluation of generative models in music

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review-evaluation-metrics.md

review-evaluation-metrics.md

Review of Evaluation Metrics

Files

review-evaluation-metrics.md

Latest commit

History

review-evaluation-metrics.md

File metadata and controls

Review of Evaluation Metrics