Skip to content

Document-level sentiment analysis of book reviews scraped from the Goodreads website. Technologies used include TensorFlow, Spark, HDFS, Sqoop, Scrapy, and D3.js.

License

Notifications You must be signed in to change notification settings

weichung96/sentiment-analysis-goodreads-reviews

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This project goes through the entire data science pipeline in an attempt to better understand book reviews data on the Goodreads website. The main objective was to examine the sentiments of user reviews and book ratings across numerous genres. The approach was to train a model on a large amount of labeled data to generalize well enough to classify the relatively small, unlabeled Goodreads reviews data set.

This work examines these relationships as a NLP problem - namely, a document level sentiment classification problem. Sentiment classifications are made and then data visualization techniques are used to gain insight into the review-rating-genre relationship.

Three machine learning techniques were used in this project to obtain classifications. One classification is done using a pretrained RNN with long short term memory units (LSTMs) and a pretrained GloVe model; both were pretrained by Adit Deshpande and may be found here. The embeddings were trained using the word vector generation model GloVe. The word embedding matrix contains 400,000 word vectors with words having dimensionality of 50. The RNN was trained on the IMDb movie review dataset containing 12,500 positive and 12,500 negative reviews.

The second classification method was done by training a bidirection LSTM network using pretrain fastText embedding from here.

The third classification method used a Naive Bayes model trained on the TF-IDF of words in each sentence constructed into the feature matrix. This was done with Apache Spark ML.

TODO

  • Further analysis and visualization are needed to reach conclusions.
  • Port XIA-NB classifier to run on GPU.

Latest Results

The bar chart was adopted from Brice Pierre de la Briere's article. The red bars represent average book ratings where there were more negative reviews predicted by the LSTM network than positive ones. The larger number of blue bars indicates that the Goodreads rating system is representative of user sentiments. D3.js

seaborn

seaborn

seaborn

seaborn

seaborn

seaborn

These graphs were generated with code adapted from Matrin Chorley's article. The nodes are colored by genre, and their radii vary by the average rating of the title. Positions in the y-direction are given by the rating multiplied by the sentiment (+1 or -1).

D3.js D3.js

This force-directed graph was generated with code adapted from Martin Chorley's article and Mike Bostock's here.

D3.js

Dependencies

Usage

  1. Install dependencies:
$ python -m virtualenv goodreads
$ source goodreads/bin/activate
$ pip install -r requirements.txt
  1. Create SQL table to store Goodreads review data:
CREATE TABLE `reviews` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `title` varchar(128) NOT NULL,
 `genre` varchar(255) NOT NULL,
 `link_url` varchar(255) NOT NULL,
 `book_url` varchar(255) NOT NULL,
 `user` varchar(32) NOT NULL,
 `reviewDate` varchar(32) NOT NULL,
 `review` text NOT NULL,
 `rating` varchar(24) NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=502 DEFAULT CHARSET=latin1;
  1. Run Scrapy web crawler:
$ cd utils
$ scrapy crawl goodreads

In pipelines.py, you may add certain words to the words_to_filter array in the RequiredFieldsPipeline class to filter the reviews.

  1. Choose classification algorithm to run: change to goodreads/models directory and run one of the following.
  • LSTM network: python train_eval_pipeline.py --train
  • SparkSentimentAnalysis.ipynb
  1. Visualize data: ..1. Start php server in goodreads/visualization directory: php -S localhost:8000. If you use python -m http.server, you will get the error "Failed to load http://localhost:8000/data.php: No 'Access-Control-Allow-Origin' header is present on the requested resource..." ..2. Open index.html in browser.

Acknowledgements

  1. Adit Deshpande's article on oreilly.com.

  2. The Naive Bayes Classifier by the Text Mining Group, Nanjing University of Science & Technology,.

About

Document-level sentiment analysis of book reviews scraped from the Goodreads website. Technologies used include TensorFlow, Spark, HDFS, Sqoop, Scrapy, and D3.js.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 64.0%
  • HTML 18.5%
  • Jupyter Notebook 12.6%
  • PHP 4.7%
  • Shell 0.2%