Skip to content

tgchacko/Movie-Recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Movie-Recommender

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Recommendation Techniques

Assumptions

Evaluation Metrics

Results

Recommendations

Limitations

Future Possibilities of the Project

References

Project Overview

This project involves creating a movie recommender system using various recommendation algorithms. The system includes simple recommenders, content-based filtering, and collaborative filtering techniques to provide movie recommendations.

Data Sources

The project utilizes two datasets:

  1. Ratings Dataset (ratings_small.csv): Contains user ratings for collaborative filtering.
  • Entries: 100,004
  • Columns: userId, movieId, rating
  1. Movie Metadata and Credits Datasets: Contains movie metadata and credits for content-based and simple recommenders.

Data Description

Ratings Dataset

  • userId: Unique identifier for each user.
  • movieId: Unique identifier for each movie.
  • rating: Rating given by the user to the movie.

Movies Dataset

  • budget: Budget of the movie.
  • genres: List of genres associated with the movie.
  • homepage: URL of the movie's homepage.
  • id: Unique identifier for each movie.
  • keywords: Keywords related to the movie.
  • original_language: Original language of the movie.
  • original_title: Original title of the movie.
  • overview: Brief description of the movie plot.
  • popularity: Popularity score of the movie.
  • production_companies: Production companies involved in making the movie.
  • production_countries: Countries where the movie was produced.
  • release_date: Release date of the movie.
  • revenue: Revenue generated by the movie.
  • runtime: Duration of the movie.
  • spoken_languages: Languages spoken in the movie.
  • status: Release status of the movie.
  • tagline: Tagline of the movie.
  • title: Title of the movie.
  • vote_average: Average rating of the movie.
  • vote_count: Number of votes received by the movie.

Credits Dataset

  • movie_id: Unique identifier for each movie.
  • title: Title of the movie.
  • cast: List of main cast members.
  • crew: List of crew members.

Tools

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

Below are the links for details and commands (if required) to install the necessary Python packages:

EDA Steps

  • Data loading and initial exploration
  • Data cleaning and manipulation
  • Checking for missing values and duplicates
  • Merging the movies and credits datasets

Data Preprocessing Steps and Inspiration

  • Handling Missing Values: Identified and handled missing values in the dataset.
  • Merging Datasets: Merged the movies and credits datasets on the id column.
  • Feature Extraction: Extracted relevant features such as cast, crew, genres, and overview for content-based filtering.
  • Creating Weighted Ratings: Calculated weighted ratings for movies using the IMDB formula.

Recommendation Techniques

  • Simple Recommender - IMDB Weighted Rating: Uses a formula to calculate weighted ratings based on average rating, number of votes, and a minimum vote threshold.

  • Simple Recommender - Trending Movies: Recommends trending movies based on popularity.

  • Content-Based Filtering:

  1. Overview Based: Recommends movies based on plot similarity using TF-IDF and cosine similarity. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.
  2. Credits, Genres, and Keywords Based: Recommends movies based on similarity in cast, crew, genres, and keywords using CountVectorizer and cosine similarity. Count Vectorize converts a collection of text documents to a matrix of token counts, helping in text analysis and feature extraction.
  • Collaborative Filtering: Singular Value Decomposition (SVD): Uses matrix factorization to predict user ratings for movies based on past user ratings.

Assumptions

  • Ratings provided by users are reliable.
  • User preferences are consistent over time.
  • Movies with higher ratings are preferred by users.

Evaluation Metrics for SVD

  • MAE (Mean Absolute Error): Measures the average magnitude of errors in a set of predictions, without considering their direction.
  • RMSE (Root Mean Squared Error): Measures the square root of the average squared differences between predicted and observed values.

Results

For IMDB Dataset:

Simple Recommender - IMDB Weighted Rating

Findings: Weighted ratings calculated using IMDB formula, top 20 movies sorted by score.

Results

Simple Recommender - Trending Movies:

Findings: Top 10 movies sorted by popularity.

Results

Content-Based Filtering - Overview Based

Findings: Recommends movies based on plot similarity using TF-IDF and cosine similarity.

Results

Content-Based Filtering - Credits, Genres, and Keywords Based

Findings: Recommends movies based on similarity in cast, crew, genres, and keywords using CountVectorizer and cosine similarity.

Results

For Ratings Dataset:

Collaborative Filtering - SVD

Findings: Predicted user ratings for movies using SVD with evaluation metrics MAE and RMSE.

Results

Findings: Top 10 recommended movies for a given user(Example - user 1)

Results

Recommendations

  • Further data collection and feature engineering could improve the recommendation accuracy.
  • Regularly updating the model with new movie data can help maintain recommendation relevance.
  • Implementing user feedback mechanisms to continuously improve recommendations.

Limitations

  • The dataset may contain biases that could affect the recommendations.
  • The recommendation performance is limited by the quality and quantity of the available data.

Future Possibilities of the Project

  • Exploring additional recommendation algorithms and ensemble methods.
  • Implementing deep learning models for better performance.
  • Developing real-time recommendation systems based on user interactions.

References