This repository provides a testing ground for Dataform features, including CLI, Core, BigQuery integration, and SQLFluff integration. It requires a Google Cloud project with several APIs enabled (BigQuery, Dataform, etc.) and access to the public Stack Overflow dataset. The development environment setup involves Python, Node.js, the Google Cloud SDK, and the Dataform CLI. Detailed instructions for authentication, project setup, and compilation are included. The README also provides a comprehensive list of references to Dataform and Google Cloud documentation.
- Navigate to the Google Cloud Console
- Create a Google Cloud Project - Create a Project
- Enable billing for your project - Enable Billing
- Enable the following APIs (they should be enabled by default, but double check) - Enable APIs
analyticshub.googleapis.com
bigquery.googleapis.com
bigqueryconnection.googleapis.com
bigquerydatapolicy.googleapis.com
bigquerydatatransfer.googleapis.com
bigquerymigration.googleapis.com
bigqueryreservation.googleapis.com
bigquerystorage.googleapis.com
dataform.googleapis.com
dataplex.googleapis.com
storage-component.googleapis.com
storage-api.googleapis.com
storage.googleapis.com
- Install Python (v3.10 or higher)
- Install Node.js (v20 or higher)
- Install Google Cloud SDK
- Install Dataform CLI (v3.0.8 or higher, it is recommended to install it globally)
- Authenticate with Google Cloud SDK using the following commands:
gcloud auth login
(This will open a browser window to authenticate with your Google Account)gcloud config set project <PROJECT_ID>
(replace<PROJECT_ID>
with your Google Cloud Project ID you created earlier)gcloud auth application-default login
(This sets up the application default credentials for your project)gcloud auth application-default set-quota-project <PROJECT_ID>
(This sets the quota project for your project)
- Clone your dataform repository
- Navigate to your dataform repository
cd <REPOSITORY_PATH>
- Setup dataform authentication:
dataform init-creds
- You will then be prompted to select your region, 1 for US, 2 for EU or 3 for Other, select the appropriate region
- If you select 3, you will be prompted to enter the region, enter the region i.e.
australia-southeast1
- If you select 3, you will be prompted to enter the region, enter the region i.e.
- You will then be prompted for ADC (default) or JSON Key (Service Account Keyfile), select the appropriate option
- If you select JSON Key, you will be prompted to enter the path to the JSON Keyfile
- If you select ADC, you will be prompted to enter your Google Cloud Billing Project ID
- OR create a
.df-credentials.json
file in the root of the directory with the following content for ADC:
{ "projectId": "<PROJECT_ID>", "location": "<REGION>", }
- Update your
workflow_settings.yaml
file with the appropriate settings for your project (Mandatory changes):- Update the
defaultProject
variable with your created Google Cloud Project ID - Update the
defaultLocation
variable with your region (i.e.US
) - Update the
INPUT_BUCKET_1
variable with your Google Cloud Storage Bucket name (i.e.gs://<BUCKET_NAME>
) - Update the
INPUT_BUCKET_2
variable with your Google Cloud Storage Bucket name (i.e.gs://<BUCKET_NAME>
) (Required for seeding sample data)
- Update the
- Install dataform dependencies:
dataform install
(This will install the necessary dependencies for your dataform project)
- Compile your dataform project:
dataform compile
(This will compile your dataform project, if there are no errors, you are good to go)
- Create a Google Cloud Storage Bucket - Create a Bucket (This is where the sample data will be stored)
- Navigate to the root of the repository
- Execute
./scripts/seed_sample_data.sh
to seed sample data into a Google Cloud Storage Bucket you created - Provide your Google Cloud Storage Bucket name when prompted (i.e
<BUCKET_NAME>
no need to includegs://
) - The script will seed sample data into your Google Cloud Storage Bucket
- Update the
workflow_settings.yaml
file with the Google Cloud Storage Bucket name you created to theINPUT_BUCKET_2
variable
- The public dataset
bigquery-public-data.stackoverflow
is used in this repository. You will need to set your region toUS
to access this dataset in both theworkflow_settings.yaml
and the.df-credentials.json
file. - If your default region is set to
US
, please ensure your Google Cloud Storage Bucket reflects this
- Documentation
- Best Practices
- Troubleshooting
- Core Github
- API Reference
- Core Reference
- CLI Reference
- Stackoverflow Dataform Reference
- Dataform Core - VSCode Extension
.
├── .vscode // VSCode window settings (optional)
├── .github // Github Actions workflows
├── .vscode-dataform-tools // VSCode Dataform extension tools
│ └── .sqlfluff // SQLFluff configuration for Dataform
├── definitions // Dataform definitions
│ ├── actions.yaml // Action definitions
│ ├── 0_sources // Data sources (raw data)
│ ├── 1_intermediate // Intermediate/staging ("silver") tables
│ ├── 2_outputs // Output/final ("gold") tables
│ ├── 3_assertions // Data quality assertions
│ ├── 4_tests // Tests (unit, etc.)
│ ├── 5_extras // Operations, functions, scripts, etc.
│ └── 6_schemas // BigQuery JSON schema files (optional)
├── includes // Dataform includes (reusable JS code)
├── scripts // Shell scripts (e.g., seeding sample data)
├── .gitignore // Files and directories ignored by Git
├── LICENSE // Project license
├── README.md // Project description and documentation
└── workflow_settings.yaml // Dataform workflow settings
- Dataform Tooling - A collection of tools and utilities for integrating Dataform into your new or existing environments
- Dataform Terraform - An example Terraform Module for deploying Dataform projects to Google Cloud Platform
The following tools are recommended for improved code quality, debugging and provide additional features for an enhanced development experience:
- Dataform Tools - VSCode Extension - Provides syntax highlighting, code completion, and linting for Dataform files
- Error Lens - Highlights inline errors and warnings in your code
- SQLFluff - Python Package - Linting and formatting for SQL files
This repository is licensed under the MIT License - see the LICENSE file for details.
Below is a list of items that I plan to implement in the future to provide a more comprehensive testing ground for Dataform features:
- Get
notebook
action inactions.yaml
to actually create the notebook - Add more BigQuery Config Examples (i.e.
partitionBy
,clusterBy
,expirationTime
,labels
,tags
,policyTags
, etc.) - Add example for
JS
implementation of.sqlx
action types-
declare
-
view
-
table
-
incremental
-
operations
-
tests
-
assertions
-