The Tuva Provider project combines and transforms messy public provider datasets into usable data. This project contains the transformations we use to create the clean datasets for users of the Tuva Project. We have made this project public to share our methodology and code.
You can easily load the cleaned provider data into your data warehouse by using the terminology seeds from The Tuva Project package.
Source data dependencies:
Data Set | Updated by Source | Source |
---|---|---|
NPPES Data Dissemination | Monthly | https://download.cms.gov/nppes/NPI_Files.html |
NUCC Health Care Provider Taxonomy | Semi-annually (January and July) | https://nucc.org/index.php/code-sets-mainmenu-41/provider-taxonomy-mainmenu-40/csv-mainmenu-57 |
CMS Medicare Provider and Supplier Taxonomy Crosswalk | Annually | https://data.cms.gov/provider-characteristics/medicare-provider-supplier-enrollment/medicare-provider-and-supplier-taxonomy-crosswalk |
- Snowflake
- You have dbt installed and configured (i.e. connected to your data warehouse). If you have not installed dbt, here are instructions for doing so.
- You have created a database for the output of this project to be written in your data warehouse.
- You have downloaded the source data and loaded it into a staging table your data warehouse.
- NPPES NPI Data (Note: source data comes zipped with many files, only the "npidata_pfile....csv" is required.)
- NUCC Health Care Provider Taxonomy
- CMS Medicare Provider and Supplier Taxonomy Crosswalk
Complete the following steps to configure the project to run in your environment.
- Clone this repo to your local machine or environment.
- Update the
dbt_project.yml
file:- Add the dbt profile connected to your data warehouse.
- Update the variable
provider_database
to use the new database you created for this project, default is "nppes"..
- Update the
models/_sources.yml
file:- Update the database where your source data has been loaded, default is "nppes".
- Update the schema where your source data has been loaded, default is "raw_data".
- If the source tables are named differently then you can add the table identifier property.
- Run
dbt build
. - For Tuva Terminology seeds, we export this data as CSV and then load it to the Tuva Public Resources bucket in Amazon S3.
Here are some SQL examples for exporting the data from Snowflake:
- Standard Provider seed export:
copy into --YOUR_S3_URL.../provider.csv from NPPES.CLAIMS_DATA_MODEL.PROVIDER file_format = (type = csv field_optionally_enclosed_by = '"') storage_integration = --YOUR_INTEGRATION overwrite = true;
- Compressed Provider seed export:
copy into --YOUR_S3_URL.../provider_compressed.csv.gz from NPPES.CLAIMS_DATA_MODEL.PROVIDER file_format = ( type = csv field_optionally_enclosed_by = '"' compression = gzip ) header = true max_file_size = 4900000000 overwrite = true single = true storage_integration = --YOUR_INTEGRATION
- Standard Other Provider Taxonomy seed export:
copy into --YOUR_S3_URL.../other_provider_taxonomy.csv from NPPES.CLAIMS_DATA_MODEL.OTHER_PROVIDER_TAXONOMY file_format = (type = csv field_optionally_enclosed_by = '"') storage_integration = --YOUR_INTEGRATION overwrite = true;
- Standard Provider seed export:
The Tuva Project team maintaining this project only maintains the latest version of the project. We highly recommend you stay consistent with the latest version.
Have an opinion on the mappings? Notice any bugs when installing and running the project? If so, we highly encourage and welcome feedback! While we work on a formal process in Github, we can be easily reached on our Slack community.
Join our growing community of healthcare data practitioners on Slack!