BigQuery Spark-SQL Batch ETL

Problem Statement

An Airline company called Spark Airways, recieves JSON flight data on daily basis in its GCS bucket. It contains fields such as id, flight_date, airline_code, flight_num, source_airport, destination_airport, departure_time, departure_delay, arrival_time, arrival_delay, airtime, and distance. Design an ETL pipeline for daily data ingestion of this flight data to produce average flight delays by flight number and distance as tables in BigQuery.

Technologies Used

Spark SQL
pyspark
Airflow (Cloud Composer)
GCS (Google Cloud Storage)
BigQuery
Shell

ETL Pipeline

Create BigQuery tables avg_delays_by_flight_nums and avg_delays_by_distance using BQ command shell script.
Create ephemeral Dataproc Cluster.
Load daily JSON file as a dataframe in pyspark script (flights-etl.py) and apply transformations to flight delays data using spark-sql to produre output tables.
Save the transformed data into GCS output bucket.
Load the transformed data from GCS output bucket into Bigquery partitioned tables.
Delete ephemeral Dataproc Cluster and transformed output data from GCS output bucket.
Use Apache Airflow to create DAGs and automate the Spark ETL batch processing job.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data-files		data-files
resources		resources
schema		schema
README.md		README.md
create_tables.sh		create_tables.sh
flights-etl.py		flights-etl.py
spark-bq-dag.py		spark-bq-dag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigQuery Spark-SQL Batch ETL

Problem Statement

Technologies Used

ETL Pipeline

About

Releases

Packages

Languages

mehroosali/bigquery-sparksql-batch-etl

Folders and files

Latest commit

History

Repository files navigation

BigQuery Spark-SQL Batch ETL

Problem Statement

Technologies Used

ETL Pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages