This is a project executed by the MSc. Big Data Analysis students of ST Josephs University Bangalore as their first semester project. The aim here was to perform sentiment analysis on music lyrics, classify music into angry, happy and sad.
- "shiny"
- "shinydashboard"
- "fmsb"
- "bslib"
- "dplyr"
- "ggplot2"
- "shinythemes"
- "geniusr"
- "DT"
- "fmsb"
- "dplyr"
- "rlang"
- "ds4psy"
- "spacyr"
- "tm"
- "parsedate"
- "caret"
-
Go to
RShiny/classification_module/classify.R
-
Copy the absolute path of the file
classify.R
-
Go to
RShiny/project.R
-
Paste the path copied in step 2 in the line 13:
source("classification_module/classify.R")
-
Install the following python packages
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
-
Changes to be made in RShiny/classification_module/classify.R
- If spacy is installed in a virtual environment, copy that path to that environment and update the virtualenv attribute of
spacy_initialize(model = 'en_core_web_sm', virtualenv = "path/spacyenv/")
. - Update the path to all the corpuses; corpuses are found in
RShiny/classification_module
. Update the following paths;angry_corpus = readRDS("path/angry_corpus.rda")
sad_corpus = readRDS("path/sad_corpus.rda")
happy_corpus = readRDS("path/happy_corpus.rda")
- If spacy is installed in a virtual environment, copy that path to that environment and update the virtualenv attribute of
-
Run the app;
- Using your terminal: Open your terminal, navigate to the project's root directory
RShiny
and runR -e "shiny::runApp('project.R')"
in your terminal. - Using RStudio: Open RStudio, navigate to
File > Open File > Go to the project folder and select project.R
. - Click on
Run App
in RStudio (top right).
- Using your terminal: Open your terminal, navigate to the project's root directory
To build a model using R Programming and R Shiny that can classify music lyrics into one of three sentiment classes, namely: Angry, Sad, and Happy.
To create a model for classifying lyrics into the aforementioned sentiments, we had to collect music data, focusing on the lyrics. Given that we were settling on building a supervised model, we needed data that was labelled. Manually collecting and labelling music lyrics is a cumbersome process, so, to make things easy, we sought no further than one of the world’s largest music databases, Spotify. Believing that Spotify has a pretty good classification model, we created playlists for each of the above sentiments, thus each song will be labelled based on the playlist it belongs to.
Using the Spotify API, we fetched all the songs in each playlist, labelling them according to their playlist (sentiment). Since Spotify’s API doesn’t give us access to music lyrics, we fetched attributes like artist name and song title that can be used to fetch the lyrics from another platform. Furthermore, using the artist's name and song title, we used Genius API to fetch the lyrics of each song. The collected data is stored in three different CSV files, according to sentiments/playlists.
The collected data, precisely the lyrics, contained a lot of noise (unwanted data) that would not be required for our model. So, we had to strip these noises from our dataset. To do this, we performed the following operations:
-
Converting lyrics to lowercase: All letters are converted into lowercase to ensure consistency. This step standardizes the text, making it uniform for the model.
-
Removing lyrics divisions: Lyrics were divided into subsections like chorus, pre-chorus, intro, etc., and some contained timestamps. These were removed as they are not useful for sentiment analysis.
-
Correcting spellings: Spelling errors were corrected to ensure accuracy in the text data.
-
Removing contractions: Contractions were decomposed into their full forms to standardize the text.
-
Removing special characters and non-alphabetical characters: Special characters and numbers that do not contribute to sentiment analysis were removed.
-
Removing extra spaces: Extra spaces, tabs, and new lines were replaced to clean up the text.
The aim here was to use the cleaned set of lyrics to extract features relevant to our model. We lemmatized our set of lyrics using part of speech tagging. Lemmatization brings words to their root forms, and combining it with part of speech tagging helps in accurately determining the root forms. We also removed stop words, which are irrelevant for sentiment analysis.
Before building our model, we had to convert our dataset into a form that is understandable by our model (Naive Bayes). We used the Document Term Matrix, specifically Term Frequency - Inverse Document Frequency (TF-IDF), to structure the text data.
-
Document Term Matrix: A technique for structuring text data where documents are placed in rows and terms in columns, with values representing term presence and weight.
-
Term Frequency - Inverse Document Frequency (TF-IDF): Quantifies the importance of a term in a document relative to a collection of documents.
Finally, the processed data was fed into a Naive Bayes classifier to build a model that can classify lyrics into Angry, Sad, or Happy sentiments.
After building the model, it was deployed using R Shiny. An interactive user interface was developed allowing users to select an artist and a song. The selected song’s lyrics are then passed to the model, and the probability of the lyrics being in each sentiment class is displayed in a bar chart.