This project aims to predict credit risk using statistical and machine learning models. We analyze a dataset containing customer demographic and financial information, applying techniques like Logistic Regression, Random Forest, and XGBoost. Advanced visualizations help us explore patterns and assess model performance.
The dataset consists of the following variables:
Variable | Description |
---|---|
Age | Age of the applicant |
Sex | Gender of the applicant |
Job | Job classification |
Housing | Type of housing (Own, Rent, Free) |
Saving.accounts | Savings account balance category |
Checking.account | Checking account balance category |
Credit.amount | Loan/Credit amount requested |
Duration | Loan duration in months |
Purpose | Purpose of the credit (car, education, business) |
Credit.risk | Target variable (1 = High Risk, 0 = Low Risk) |
-
Exploratory Data Analysis (EDA)
- Distribution plots, correlation heatmaps, and boxplots.
- Feature engineering and handling missing values.
-
Predictive Modeling
- Logistic Regression: Interpretable linear model for credit risk classification.
- Random Forest: Ensemble learning technique for improving prediction accuracy.
- XGBoost: Gradient boosting algorithm for optimizing model performance.
-
Model Evaluation Metrics
- Confusion Matrix
- ROC Curve & AUC
- Feature Importance
- Calibration & Lift Curves
This histogram shows how credit amounts are distributed across applicants.
Visualizing how credit amounts differ based on housing status.
Visualizing how credit amounts differ based on purpose.
Visualizing how credit amounts differ based on Saving.
Highlights relationships between numerical variables.
Evaluates classification performance for different models.
Shows the most important variables influencing credit risk prediction.
- Age and Duration have a strong impact on credit risk assessment.
- Housing and Checking Account Balance are key determinants in predicting loan defaults.
- Random Forest and XGBoost outperform Logistic Regression in predictive accuracy.
- AUC Score Comparison:
- Logistic Regression: 0.76
- Random Forest: 0.84
- XGBoost: 0.87
install.packages(c("ggplot2", "corrplot", "pROC", "randomForest", "xgboost", "caret", "shap"))
library(ggplot2)
library(corrplot)
library(pROC)
library(randomForest)
library(xgboost)
library(caret)
library(shap)
data <- read.csv("credit_data.csv")
source("credit_risk_analysis.R")
This project successfully predicts credit risk using advanced statistical and machine learning techniques. The results show that ensemble models like Random Forest and XGBoost significantly improve predictive accuracy.
- Hyperparameter tuning for better model optimization.
- Deploying the model as an API for real-time risk prediction.
- Exploring deep learning approaches for credit scoring.
- Joel Mande
- LinkedIn: https://linkedin.com/in/wajoel
- GitHub: /~https://github.com/wajoel