- Project Overview
- Data Information
- Models Evaluated
- Benchmark Results
- Analysis
- Conclusions
- Getting Started
- Future Work
- Contributing
- License
This project implements a comprehensive benchmark for fine-tuning various BERT-based pre-trained models on a subset of the IMDB movie review dataset for sentiment classification. We evaluate and compare the performance of these fine-tuned models in terms of accuracy, precision, recall, F1 score, training time, and model size.
We use a subset of the IMDB dataset for fine-tuning and evaluation:
- Dataset: Subset of IMDB Movie Reviews
- Task: Binary Sentiment Classification (Positive/Negative)
- Original Size: 50,000 reviews (25,000 training, 25,000 testing)
- Subset Size: 2,000 reviews (1,000 for training, 1,000 for testing)
- Source: IMDB Dataset
- Features: Text reviews and binary sentiment labels
We fine-tune the following BERT-based pre-trained models:
- BERT (bert-base-cased)
- ALBERT (albert-base-v2)
- RoBERTa (roberta-base)
- DistilBERT (distilbert-base-uncased)
Each model is fine-tuned on our IMDB subset, adapting its pre-trained knowledge to the specific task of movie review sentiment classification.
Results after fine-tuning on the IMDB subset:
Model | Accuracy | Precision | Recall | F1 Score | Training Time (s) | Parameters |
---|---|---|---|---|---|---|
bert-base-cased | 0.847 | 0.827 | 0.869 | 0.847 | 262.45 | 108,311,810 |
albert-base-v2 | 0.869 | 0.936 | 0.785 | 0.854 | 224.26 | 11,685,122 |
roberta-base | 0.875 | 0.962 | 0.775 | 0.858 | 221.61 | 124,647,170 |
distilbert-base-uncased | 0.842 | 0.915 | 0.746 | 0.822 | 113.65 | 66,955,010 |
-
Accuracy:
- Fine-tuned RoBERTa performs best (87.5%), followed closely by ALBERT (86.9%).
- Fine-tuned BERT and DistilBERT show slightly lower accuracy (84.7% and 84.2% respectively).
-
Precision and Recall:
- Fine-tuned RoBERTa has the highest precision (96.2%), but lower recall compared to BERT.
- Fine-tuned BERT shows the highest recall (86.9%), indicating better performance in identifying positive samples.
-
F1 Score:
- Fine-tuned RoBERTa leads with an F1 score of 0.858, closely followed by ALBERT (0.854).
- Fine-tuned DistilBERT has the lowest F1 score (0.822), suggesting a less balanced performance.
-
Training Time:
- Fine-tuning DistilBERT is significantly faster (113.65s), about half the time of other models.
- Fine-tuning BERT takes the longest (262.45s), while RoBERTa and ALBERT have similar fine-tuning times.
-
Model Size:
- ALBERT is by far the smallest model (11.7M parameters), making it efficient for deployment.
- RoBERTa is the largest (124.6M parameters), followed closely by BERT (108.3M parameters).
-
Efficiency vs. Performance:
- Fine-tuned ALBERT offers the best balance of performance and efficiency, with high accuracy and the smallest model size.
- Fine-tuned RoBERTa provides the highest accuracy but at the cost of model size.
- Fine-tuned DistilBERT, while less accurate, offers significant speed advantages and a smaller model size compared to BERT.
After fine-tuning on the IMDB subset:
- For highest accuracy: Choose RoBERTa
- For best balance of performance and efficiency: ALBERT
- For fastest inference and smallest model size: DistilBERT
The choice of model depends on the specific requirements of the application, balancing factors such as accuracy, speed, and resource constraints. These results demonstrate the effectiveness of fine-tuning pre-trained BERT-based models even on a small subset of task-specific data.
- Clone the repository:
git clone /~https://github.com/MahtabRanjbar/BERT-Model-Benchmark-for-IMDB-Sentiment-Classification.git
cd BERT-Model-Benchmark-for-IMDB-Sentiment-Classification
- Install dependencies:
pip install -r requirements.txt
-
Configure the fine-tuning and benchmark parameters in
config/config.yaml
-
Run the benchmark:
python src/main.py
- Results will be saved in
benchmark_results.csv
andbenchmark_results.png
.
- Experiment with larger variants of these models (e.g., BERT-large, RoBERTa-large)
- Increase the subset size to investigate performance on larger datasets
- Implement cross-validation for more robust results
- Explore advanced fine-tuning strategies to improve performance
- Benchmark on other sentiment analysis datasets for comparison
- Investigate the impact of different preprocessing techniques