Sparkify Music Data Warehouse in AWS Redshift

This project is a part of the Udacity nanodegree in data engineering curriculum Sparkify is a music streaming startup which has grown their user base and song database and want to move their processes and data onto the AWS cloud for better performance.he dataset resides in S3, in a directory of JSON logs on users activity on the app, as well as directory with JSON metadata on the songs in their apps.

As an data engineer, we built an ETL pipeline that extracts their data from S3, then stages them in AWS Redshift and transforms the data into set of dimensional tables for the analytics team to continue finding insights in what songs thier users are listening to.

Project Dataset

We used the dataset provided by the Udacity that reside in the AWS S3 with the following links

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Schema for the Song Play Analysis

Fact Table

songplays - records in event data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, lattitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Project Template

In this repository, the workspace contains the following files and its description

create_table.pyThis is the python script file which creates the fact and dimension tables for the star schema in AWS Redshift
etl.py This is the python script file that will load the json cluster files from S3 into the staging tables on AWS Redshift and then process that data into the anlytics tables on the AWS Redshift
sql_queries.py This is where we define the SQL statements which will be imported into the two files above

Steps followed for the Data Warehouse Setup:

Create table Schema:

Designed the Star schema for the fact and dimensions table
Wrote the SQL queries for CREATE and DROP in the sql_queries.py files
Used the create_table.py sile to connect to the database and create the tables
Launched the redshift cluster and created the the IAM role that can read access to S3
Added the redshift database and IAM role info in the dwh.cfg file
Run the scripts

Build ETL Pipeline in python:

Implemented the logic in the etl.py file to load the dataset from s3 to staging tables on the Redshift
Implemented the logic in the etl.py file to load the dataset from staging tables to analytics table on the Redshift

Optimization in the redshift table's distribution style

When we load the data into the table, AWS Redshift distributes the rows of the tables to each of the compute nodes according to the tabledistribution style. When we run a query the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations

When we create a create we can designate one of the four distributonstyles: AUTO,EVEN,KEY,ALL In this create statement of the final tables we used DISTKEY and SORTKEY. Sortkey is the keyword that specifies that the column is the sort key in the tables. When the data is loaded into the tables, the data is sorted by one or more columns that are designeated as sortkey

Similarily for the second key i.e DISTKEY, it specifies that the column to be used as the distribution key for the table. we can use the DICTKEY after the column name or as a nmae table definition

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ER.png		ER.png
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Music Data Warehouse in AWS Redshift

Project Dataset

Schema for the Song Play Analysis

Fact Table

Dimension Tables

Project Template

Steps followed for the Data Warehouse Setup:

Create table Schema:

Build ETL Pipeline in python:

Optimization in the redshift table's distribution style

About

Releases

Packages

Languages

License

Kanishkparganiha/data-warehouse-on-AWS-Redshift-for-Music-Application

Folders and files

Latest commit

History

Repository files navigation

Sparkify Music Data Warehouse in AWS Redshift

Project Dataset

Schema for the Song Play Analysis

Fact Table

Dimension Tables

Project Template

Steps followed for the Data Warehouse Setup:

Create table Schema:

Build ETL Pipeline in python:

Optimization in the redshift table's distribution style

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages