The streamlit app is developed upon a classification model that is trained on a Portuguese banking institution data to make predictions whether a bank customer would subscribe to a term deposit. For model training, full data set is deployed, with approximately 41.4k rows and 21 columns. Then, the trained model is prepared for cloud deployment. The full workflow can be found down below.
The model is uploaded to streamlit for public use to demonstrate applicability and performance of the trained model. The model is consisted of 6 features, which are selected by RFECV algorithm as the most significant variables in terms of variance explanability. Trained model is capable of making prediction with approximately 95% accuracy with following features;
- Occupation
- Last contact day of the week
- Number of contacts performed during this campaign and for this client
- Outcome of the previous marketing campaign
- Consumer price index - monthly indicator
- Euribor 3 month rate - daily indicator
Responses are converted to binary variables, which were labeled as "Yes" and "No" originally, for model development purposes. Approximately, 89% of the labels are "No", or 0, and the remaining are "Yes", or 1. Thereon, two different approaches were followed to develop the best model in terms of higher prediction capability.
First, due to the imbalanced nature of the target variable, the under-sampled target variable "yes" is synthetically re-populated using SMOTE. Thereon, the main performance metric is selected as the accuracy score of prediction. Next, under-sampled target variable, which is not be re-populated, and thus, the main performance metric is Precision - Recall Curve and scores.
Moreoever, feature selection is conducted using Recursive Feature Elimination (RFE). RFE allows model to assign importance for each feature deployed. Later, the weighted features are ranked in an accordance with their corresponding importance score. The least significant features, in terms of ranking, are pruned from the model.
Two different approaches yielded almost identical performances in terms of making predictions.
Model | Performance Metric | Score |
---|---|---|
Imbalanced XGBClassifier | Precision + Recall | 0.95 |
Balanced XGBClassifier | Accuracy | 0.943 |
However, upon completion of feature selection, RFECV algorithm concluded significantly less features as the most important ones, with balanced data. On the contrary, Imbalanced data required more features to be able to achieve this accuracy score.
Conclusion, balanced data performed superior with less data requirement. Therefore, the streamlit app will be built upon using the features selected by RFECV as the most important with balanced data.