Objective: Establish a robust data engineering pipeline that leverages data quality checks to deliver timely, accurate, and reliable data-driven insights for healthcare revenue cycle management (RCM). This pipeline will generate fact and dimension tables to support reporting teams in generating critical KPIs.
Healthcare providers face challenges in managing their revenue cycle. Inefficient processes, data quality issues, and a lack of insights can lead to delayed payments, increased costs, and reduced operational efficiency.
RCM is a complex process with multiple stages, including:
- Patient Visit: Patient information and insurance details are collected.
- Service Delivery: Healthcare services are provided by doctors and hospitals.
- Billing: Bills are generated and sent to insurance providers or patients.
- Claims Processing: Insurance companies review claims and make payment decisions.
- Payment and Follow-up: Payments are collected and follow-up actions are taken for outstanding balances.
- Performance Tracking: Key performance indicators (KPIs) are monitored for improvement areas.
RCM ensures the hospital can provide quality care while maintaining financial health.
- Accounts Receivable (AR): Money the hospital must collect from patients and insurance companies for services rendered.
- Accounts Payable (AP): Money the hospital owes to stakeholders like doctors, staff, and suppliers.
Key Risk Factors and Challenges:
- Patient Responsibility: A significant risk arises when patients are responsible for a portion or the entire cost of their care.
Objectives for Accounts Receivable:
- Maximize Patient Payments: Encourage timely and complete payments from patients.
- Minimize Collection Period: Reduce the time it takes to collect outstanding balances.
The project employs a Medallion architecture for data flow: Project Flow:
Landing (flat file) -> Bronze (Parquet files) -> Silver (Delta table) -> Gold (Delta table)
Data Sources:
- Electronic Medical Records (EMR) Data (Azure SQL Database)
- Claims Data (Flat files uploaded monthly provided by insurance companies to ADLS Gen2)
- NPI Data (Sourced from a public API)
- ICD Codes (Sourced from a public API)
- CPT Codes (Flat files uploaded monthly to ADLS Gen2)
Key Components:
- Data Ingestion: Azure Data Factory (ADF) for data ingestion from various sources.
- Data Transformation: Data cleaning, validation, normalization, and enrichment using Azure Databricks.
- Data Storage: Azure Data Lake Storage Gen2 for storing raw and processed data.
- Data Warehouse: Delta Lake tables for storing the refined data.
- Data Quality Monitoring: Regular checks to ensure data accuracy and completeness.
- Data Security: Using Key Vault for storing passwords and all credential keys.
Fact Table: fact_transactions
(This table stores the core transactional data related to healthcare services provided. It captures the events that generate revenue and are the primary focus of RCM analysis. Each row in this table represents a single transaction (e.g., a specific medical procedure performed on a patient).)
Dimension Tables:
dim_patient
(Stores information about patients.)dim_provider
(Stores information about healthcare providers (doctors, nurses, etc.).)dim_department
(Stores information about hospital departments (e.g., Cardiology, Oncology).)dim_icd_code
(Stores information about ICD codes (International Classification of Diseases))dim_npi
(NPI data)dim_cpt_code
(Stores information about CPT codes (Current Procedural Terminology))
- Common Data Model (CDM): Standardization ensures alignment with industry standards and best practices. Mapping rules reconcile schema and data elements between hospitals.
- Data Quality Checks: Data cleansing techniques remove or correct errors, inconsistencies, and outliers. Validation checks ensure data accuracy and completeness. Consistency across sources and transformations is verified.
- Slowly Changing Dimensions (SCD2): Captures changes in dimension tables over time. Historical data is maintained for trend analysis and reporting. Effective date and end date columns track changes and validity periods.
Before proceeding with the main data pipeline, a preliminary step involves loading the EMR data from CSV files into Azure SQL Database tables for both hospitals (Hospital A and Hospital B). This ensures the SQL databases are populated with the necessary data for subsequent pipeline stages. This is achieved through an Azure Data Factory (ADF) pipeline named pipe_to_insert_data_to_sql_table_preprocessing
.
Data Loading Logic: The pipeline copies data from the CSV files into the following tables within the respective SQL databases:
Outcome: Upon successful execution of the pipeline, the SQL databases are populated with the EMR data from the CSV files, making the data available for the subsequent stages of the main data pipeline.
Data Sources:
- CSV files representing EMR data for each hospital. These files are uploaded to a dedicated container (
raw-data-for-sql-database
) within the provided ADLS Gen2 storage (adlshealthcareprojectdev
). The files are organized within foldersHospitalA
andHospitalB
for clarity. - A lookup file (JSON format) is also uploaded to the same container within a folder called
Lookup
. This file is used for metadata or configuration during the data loading process.
Data Sinks:
- Azure SQL Databases:
sqldb-hospital-a
andsqldb-hospital-b
. These databases are assumed to be pre-existing.
Pipeline Creation Steps:
- Creation of Linked Services:
- For ADLS Gen2 (Source): The linked service
ls_adlsgen2
is used. - For SQL DB (Sink): The existing linked service
ls_azuresqldb
is used.
- For ADLS Gen2 (Source): The linked service
- Creation of Datasets:
Pipeline Configuration Steps:
- Add a Lookup Activity: This activity reads the lookup file (JSON). The "First row only" option is set to
false
to read all rows. - Add a ForEach Activity: This activity iterates over the output of the Lookup activity. The
Items
property is set to@activity('Lk_file_name').output.value
. - Configure the ForEach Activity: Inside the ForEach activity:
Securing the Pipeline with Key Vault and Mount Points: This section details how Key Vault and mount points are used to optimize data pipeline security and organization.
- 0.1 Securing the Pipeline with Key Vault and Mount Points: Mount points (landing, bronze, silver, gold, configs) improve data organization, enhance security by storing sensitive information in Azure Key Vault, simplify configuration, and enable flexible scaling.
- 0.2 Key Vault Setup:
- Create Key Vault (
kv-healthcare-dbs-azure
). - Add Secrets:
sqldb-password
(SQL Database access key)adls-access-key
(ADLS Gen2 access key)databricks-access-token
(Databricks workspace access token)
- Create Databricks Secret Scope (e.g.,
databricks-kv
) linked to the Key Vault. - Implement Granular Access Control (Permissions) by registering applications in Azure AD and assigning specific Key Vault permissions.
- Create Key Vault (
- 0.3 Mounting Storage with Secure Access: (Code snippet to be included in the project repository)
Data Ingestion Strategy Rationale: The initial data ingestion from Azure SQL databases to the Bronze layer in ADLS Gen2 is handled by Azure Data Factory (ADF).
-
Due to the potential for large datasets in tables like
Transactions
,Encounters
, andPatients
, an incremental load strategy is implemented within the ADF pipelines for these specific tables. This involves querying the SQL databases for only new or updated records since the last successful load, minimizing data transfer and processing time. This is often achieved using a timestamp or sequence number column (watermark column name) in the source tables. -
For smaller dimension tables like
Providers
,Departments
, a full load strategy is employed. Since these tables are not expected to change as frequently or contain as much data, a full load is simpler to implement and doesn't introduce a significant performance overhead. -
1.1 ADF pipeline for ingesting EMR data from Azure SQL to Bronze:
-
Uses parameterized queries and metadata-driven configuration (load patterns stored in
ttadlsdev
in ADLS Gen 2emr/load_config.csv
). -
Loads tables based on the
load_config
file and inserts load activity results into anaudit.load_logs
Delta table. -
Uses Linked Services for connections to:
- Azure SQL DB (parameterized for both hospitals)
- ADLS Gen2
- Delta Lake (for audit table)
- Key Vault
- Databricks
-
Uses Datasets with parameters for database name, schema name, table name, config file path, target data path, and audit table details.
-
Pipeline Activities:
- Lookup Activity: Reads the config file (
configs/emr/load_config.csv
). - For Each Activity: Iterates over each entity in the config file.
- Runs pipelines for ingesting tables in parallel.
- Get Metadata Activity: Checks if the file exists in the Bronze folder.
- If Condition (File Exists): Moves existing files to an archive folder.
- If Condition (is_active Flag): Checks the
is_active
flag in the config file. Executes pipeline for data loading ifis_active
is 1.
- Runs pipelines for ingesting tables in parallel.
- Lookup Activity: Reads the config file (
-
-
1.2 Ingesting Claim and CPT code data from Landing to Bronze:
- Uses Databricks notebooks to ingest data from the landing folder (
/claims
and/cptcodes
) to the bronze folder in Parquet format with overwrite mode. - Performs minor transformations (column name modifications, data source input).
- Creates temporary views for data quality checks.
- Uses Databricks notebooks to ingest data from the landing folder (
-
1.3 Ingesting NPI and ICD data from public API to Bronze: (Code snippet to be included in the project repository).
- Extracts data from public APIs.
- Defines schemas for each column.
- Creates DataFrames with defined schemas and writes them in Parquet format (append mode for ICD, overwrite mode for NPI).
Data in Bronze Layer:
Transformation Logic - Using Databricks notebooks:
- Providers: Full load, overwrite using Delta tables without SCD. Unions data from both hospitals, creates a schema with an
is_quarantined
column for data quality checks, performs data quality checks for null values and duplicate Provider IDs. - Departments: Full load, overwrite using Delta tables without SCD. Unions data from both hospitals with transformations for mismatched department IDs, creates a schema with
is_quarantined
, performs data quality checks for null values. - Patients: SCD2 and incremental load using Delta tables. Unions data from both hospitals with schema adjustments, performs data quality checks for null values, loads data into the Silver layer, and implements SCD2 using
MERGE INTO
withinserted_date
,modified_date
, andis_current
columns. - Transactions: Implementing SCD2 and incremental load using Delta table format.
- Encounters: Implementing SCD2 and incremental load using Delta table format.
- Claims & CPT: Implementing SCD2 and incremental load using Delta table format.
- NPI & ICD codes: Implementing SCD2 and incremental load using Delta table format.
Data in the Gold layer (Facts and Dimensions tables) consists of the latest, non-quarantined records (is_quarantined = False
).
- Implement more comprehensive data quality rules and monitoring.
- Implement data lineage tracking.
- Enhance CI/CD pipeline with automated testing and deployment.
- Implement more sophisticated performance tuning and optimization strategies.