This project contains a dynamic risk assessment system in which a customer churn model is monitored after a simulated deployment.
Customer churn refers to the clients that have a high probability of halting the use of the services provided by a company, a.k.a. attrition risk. This is a common business problem faced by all companies; a related key principle behind it is that it's easier to keep a customer than getting a new one. Thus, the goal is to predict churn and a to avoid it.
I took the starter code for this project from the Udacity Machine Learning DevOps Engineer Nanodegree and modified it to the present form, which deviates significantly from the original version.
The focus of this project doesn't lie so much on the data processing or modeling, but on the techniques and technologies used for model/pipeline monitoring after deployment; in fact, dummy datasets are used instead of realistic ones. Monitoring is achieved by implementing methods that enable these functionalities:
- Data ingestion: new data can be checked ad-hoc.
- Re-training, Re-scoring and Re-deploying: with new data, data and model drift can be computed; if there is drift, the monitoring system is able to re-train and re-deploy a model/pipeline.
- Online diagnostics via an API: any deployed model/pipeline can be diagnosed by stakeholders using a REST API.
- Status reporting: beyond API diagnoses, complete model status reports can be generated.
- Process automation with cron jobs: the complete monitoring process can be automated, from data check to re-deployment and reporting; additionally, the execution can run with the desired frequency.
- SQL database: monitoring records are persisted to a SQLite database using SQLAlchemy.
- A Dynamic Risk Assessment System: Monitoring of a Customer Churn Model
The dataset is composed of 5 CSV files, with 5 columns each, distributed as follows:
data
├── ingested/ # Ingested data folder (populated when run)
│ └── ...
├── development # Data used during development
│ ├── dataset1.csv # Shape: (17, 5)
│ └── dataset2.csv # Shape: (19, 5)
├── production # Production data, for re-training
│ ├── dataset3.csv # Shape: (11, 5)
│ └── dataset4.csv # Shape: (15, 5)
└── test # Data for model testing
└── test_data.csv # Shape: (5, 5)
The files contain fabricated information of hypothetical corporations and, as shown, they consist of less than 20 entries/rows each. The 5 common columns are the following:
corporation
: fictional name of the customer company (hashed name)lastmonth_activity
: number of services/goods provided last monthlastyear_activity
: number of services/goods provided last yearnumber_of_employees
: number of employees at the customer companyexited
: target, whether the customer company ceased to buy services/goods.
In summary, the dataset consists of 3 useful numerical features and a binary classification is done, which predicts customer company churn.
The directory of the project consists of the following files:
.
├── Instructions.md # Summary of instructions by Udacity
├── README.md # This file
├── api_calls.py # Calls to the API
├── app.py # API endpoints
├── assets/ # Images and additional files
│ └── ...
├── config.json # Configuration parameters for scripts
├── cronjob.txt # Cron job
├── data/ # Dataset(s), dev and prod
│ └── ...
├── db # SQLite database with monitoring records
│ └── ...
├── db_setup.py # Monitoring database definition
├── deployment.py # It deploys a trained model
├── diagnostics.py # Model and data diagnostics
├── full_process.py # It checks whether re-deploy needed
├── ingestion.py # It ingests new data
├── models/ # Training artifacts (dev and prod)
│ └── ...
├── reporting.py # Reports about model metrics
├── requirements.txt # Deployment dependencies
├── scoring.py # Model scoring
├── training.py # Model training, artifacts generated
└── wsgi.py # API deployment
Once we have created an environment and installed the required dependencies, we can run the scripts separately as follows:
# Ingest data
python ingestion.py
# Train model/pipeline
python training.py
# Deploy model/pipeline
python deployment.py
# Score model/pipeline
python scoring.py
# Report
python reporting.py
# Run diagnosis
python diagnosis.py
Alternatively, we can run the full process as follows:
python full_process.py
If we want to start the diagnosing API, we need to run:
# Shell 1: Start the API on the web server: http://localhost:8000
python app.py
# Shell 2: API access calls
python api_calls.py
The section Monitoring Implementation explains in more detail what happens in each of the steps/scripts.
If you'd like to control where the notebook runs, you need to create a custom environment and install the required dependencies. A quick recipe which sets everything up with conda is the following:
# Create an environment
conda create --name churn pip
conda activate churn
# Install pip dependencies
pip install -r requirements.txt
As mentioned in the introduction, this mini-project focuses on monitoring techniques. Monitoring is essential in production after having deployed a machine learning model, because it helps address inevitable issues that will appear in our system, such as:
- Data drift: sooner or later, the distribution of data features that arrive to the model will change as compared to the original training dataset; we need to detect that to re-train and re-deploy the inference pipeline.
- Data integrity: some features might be missing or corrupt; we should detect and process them.
- Model accuracy might decrease with time, e.g., because the business context changes (more customers leave than usual because of the global economic situation); we should detect that to re-train the model.
- New component versions might destabilize the system; we should detect and fix those dependency inconsistencies.
- etc.
To fix all those issues, monitoring is applied in 5 aspects:
- Data Ingestion
- Training, Scoring, Deploying
- Diagnostics
- Reporting
- Process Automation
Additionally, a SQL database is created to persist the records that are produced during the monitoring.
Note that the file paths in the following subsections are denoted for the development
stage; in a production
stage:
- The data is ingested from
data/source
. - The training, scoring and diagnosing artifacts are output to
models/production
.
The distinction between development
and production
can be controlled by manually updating config.json
as follows:
-
In
development
:"input_folder_path": "data/development" "output_model_path": "models/development"
-
In
production
:"input_folder_path": "data/production" "output_model_path": "models/production"
Additionally, in a real production environment, the flask_secret_key
field from config.json
should be removed from the file/repository. Instead, we should use either (i) secrets, (ii) environment variables, or (iii) the production config.json
should not be committed.
The script ingestion.py
is responsible for merging data from different sources. Additionally, a record of source information is stored in order to backtrace the origin of the values.
As the rest of the scripts, ingestion.py
relies on config.json
, which defines all the parameters (i.e., filenames, paths/URLs, etc.).
Produced outputs:
data/ingested/final_data.csv
: merged dataset.data/ingested/ingested_files.csv
: dataset origin info related to the merge (path, entries, timestamp).- Records in
db/monitoring.sqlite
: the contents fromingested_files.csv
are appended to theIngestions
table.
After loading all the necessary parameters from config.json
, the following three files train the inference pipeline (model), evaluate its performance (i.e., score it on a test dataset) and deploy it to the production location (sub-tasks listed):
training.py
:- Read merged dataset:
data/ingested/final_data.csv
- Define and train a logistic regression model
- Save the model pickle:
models/development/trained_model.pkl
- Read merged dataset:
scoring.py
:- Load the saved model pickle:
models/development/trained_model.pkl
- Load the test dataset:
data/test/test_data.csv
- Compute the F1 score of the model on the test dataset
- Persist score records to file:
models/development/latest_score.csv
- Persist score records to overall database:
db/monitoring.sqlite
- Load the saved model pickle:
deployment.py
:- Copy the following files from the development/practice folders to the deployment folder
deployment
:- The trained model:
models/development/trained_model.pkl
- The records of the ingested data files used for training:
data/ingested/ingested_files.csv
- The records of the model scores:
models/development/latest_score.csv
- The trained model:
- Copy the following files from the development/practice folders to the deployment folder
The script diagnostics.py
is responsible tracking dataset properties (to prevent data drift) and model predictions (to prevent model drift). Additionally, it measures operational aspects (timings, dependencies) to prevent anomalies. All in all it:
- Performs model predictions with a test dataset:
data/test/test_data.csv
. - Provides with statistics of the training dataset, i.e., column means, medians, std. devs., NAs (count and percentage).
- Computes the timing for for the
ingestion.py
andtraining.py
scripts. - Provides information on dependencies: expected versions vs. actual per package; the
requirements.txt
file is used.
Reporting is a accomplished with three scripts:
The file reporting.py
uses the model_prediction()
function from diagnostics.py
to predict the classes from data/test/test_data.csv
and generate a confusion matrix, which is saved to models/development/confusion_matrix.png
.
If we run the file app.py
python app.py
it creates and serves an API based on Flask with the following endpoints, which can be used from another terminal session or with the browser:
# Predict in batch given a path to a dataset, e.g., data/test/test_data.csv
curl "http://localhost:8000/prediction?filename=data/test/test_data.csv"
# Predict in batch given a path to a dataset and compute F1 score;
# if no filename passed, data/test/test_data.csv is used.
curl "http://localhost:8000/scoring"
# Given a dataset, compute its summary stats, i.e.,
# for each column/feature: mean, median, std. dev., NAs;
# if no filename passed, data/ingested/final_data.csv is used.
# Note: HTML table is returned.
curl "http://localhost:8000/summarystats"
# Check the time necessary for ingestion and training.
curl "http://localhost:8000/diagnostics/timing"
# Check the dependencies.
# Note: HTML table is returned.
curl "http://localhost:8000/diagnostics/dependencies"
# Redirect to '/diagnostics/timing'.
curl "http://localhost:8000/diagnostics"
In app.py
, the functions from diagnostics.py
are used to compute the responses.
Finally, the file api_calls.py
uses all those API endpoints and writes their responses to the file models/development/api_returns.txt
.
As always, any necessary parameters (i.e., paths, filenames, etc.) are taken from config.json
.
It is possible to run all scripts mentioned so far in sequence, as done by full_process.txt
; the following image shows the complete monitoring workflow implemented in that file:
As we can see, full_process.txt
performs these actions:
- Check if there is new data; if so, ingest it and continue
- Check if there is model or data drift; if so:
- Re-train
- Re-deploy
- Run reporting for the re-deployed model
- Compute new score for the re-deployed model
- Run diagnostics for the re-deployed model
We can launch the app with API in the same execution, but that means we need to kill any previously running API. The monitoring process can be further automated by creating a cron job which executes full_process.txt
every 10 minutes, as defined in cronjob.txt
. To create a cron job on a Unix machine, we can follow this recipe:
# Start the cron service, if not done already
service cron start
# Edit the crontab with the job
# * * * * 10 python /home/mikel/git_repositories/churn_model_monitoring/full_process.py
crontab -e
# ESC, i, edit, :q
# Load the crontab and display jobs
crontab -l
The current implementation produces the monitoring records of each run as CSV
files and also as entries in a SQLite database. The CSV files are ASCII files that can be moved along with other artifacts; as such, the complement the inference pipeline. On the other hand, the SQLite database stores the records of all runs together; it is created in db/monitoring.sqlite
when ingestion.py
or scoring.py
are executed. Its implementation is in db_setup.py
.
The current database db/monitoring.sqlite
has two tables: Ingestions
and Scores
. As any with SQLite database, we can use the CLI tool to query its contents:
# Install SQLite - on a Mac:
brew install sqlite
# Go to database folder
cd db
sqlite3
.open monitoring.sqlite
.tables
# Ingestions Scores
# Show content of the tables
SELECT * FROM Ingestions;
# ...
SELECT * FROM Scores;
# ...
# Exit
.quit
For more information on SQL and related tools, check my sql_guide.
This is a basic monitoring project in which a very simple model/pipeline is created. However, a valid, general framework which uses the minimum amount of 3rd party tools is defined; moreover, we can easily scale the complexity of the inference pipeline, e.g., by defining it in a separate library which is accessed by the scripts in this project. As such, this repository can be used as a blueprint for basic monitoring. Without using any 3rd party tools, one could improve the project as follows:
- Store datasets and records in SQL databases, e.g., with MySQL Connector/Python, SQLite and SQLAlchemy.
- Generate PDF reports which aggregate all outcomes (plots, summary statistics, etc.); check: reportlab.
- Store time trends: timestamp the reported results and store them (e.g., NAs, latency, etc.).
- Add tests.
My notes and guides:
- My personal notes on the Udacity MLOps Nanodegree: mlops_udacity.
- My personal guise on SQL and related tools, such as SQLAlchemy: sql_guide.
- My boilerplate for reproducible ML pipelines using MLflow and Weights & Biases: music_genre_classification.
- Notes on how to transform research code into production-level packages: customer_churn_production.
- My summary of data processing and modeling techniques: eda_fe_summary.
- My notes on the Udemy course Deployment of Machine Learning Models by Soledad Galli & Christopher Samiullah: deploying-machine-learning-models.
Other links:
- MLOps: What It Is, Why It Matters, and How to Implement It
- MLOps Core
- 10 Amazing MLOps Learning Resources
- MLOps-Reducing the technical debt of Machine Learning
Mikel Sagardia, 2022.
No guarantees.
If you find this repository useful, you're free to use it, but please link back to the original source.