Skip to content

Kanishkparganiha/data-warehouse-on-AWS-Redshift-for-Music-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify Music Data Warehouse in AWS Redshift

This project is a part of the Udacity nanodegree in data engineering curriculum Sparkify is a music streaming startup which has grown their user base and song database and want to move their processes and data onto the AWS cloud for better performance.he dataset resides in S3, in a directory of JSON logs on users activity on the app, as well as directory with JSON metadata on the songs in their apps.

As an data engineer, we built an ETL pipeline that extracts their data from S3, then stages them in AWS Redshift and transforms the data into set of dimensional tables for the analytics team to continue finding insights in what songs thier users are listening to.

Project Dataset

We used the dataset provided by the Udacity that reside in the AWS S3 with the following links

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Schema for the Song Play Analysis

Fact Table

  1. songplays - records in event data associated with song plays i.e. records with page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  1. users - users in the app
    • user_id, first_name, last_name, gender, level
  2. songs - songs in music database
    • song_id, title, artist_id, year, duration
  3. artists - artists in music database
    • artist_id, name, location, lattitude, longitude
  4. time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

Project Template

In this repository, the workspace contains the following files and its description

  • create_table.pyThis is the python script file which creates the fact and dimension tables for the star schema in AWS Redshift
  • etl.py This is the python script file that will load the json cluster files from S3 into the staging tables on AWS Redshift and then process that data into the anlytics tables on the AWS Redshift
  • sql_queries.py This is where we define the SQL statements which will be imported into the two files above

Steps followed for the Data Warehouse Setup:

Create table Schema:

  • Designed the Star schema for the fact and dimensions table
  • Wrote the SQL queries for CREATE and DROP in the sql_queries.py files
  • Used the create_table.py sile to connect to the database and create the tables
  • Launched the redshift cluster and created the the IAM role that can read access to S3
  • Added the redshift database and IAM role info in the dwh.cfg file
  • Run the scripts

Build ETL Pipeline in python:

  • Implemented the logic in the etl.py file to load the dataset from s3 to staging tables on the Redshift
  • Implemented the logic in the etl.py file to load the dataset from staging tables to analytics table on the Redshift

Optimization in the redshift table's distribution style

When we load the data into the table, AWS Redshift distributes the rows of the tables to each of the compute nodes according to the tabledistribution style. When we run a query the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations

When we create a create we can designate one of the four distributonstyles: AUTO,EVEN,KEY,ALL In this create statement of the final tables we used DISTKEY and SORTKEY. Sortkey is the keyword that specifies that the column is the sort key in the tables. When the data is loaded into the tables, the data is sorted by one or more columns that are designeated as sortkey

Similarily for the second key i.e DISTKEY, it specifies that the column to be used as the distribution key for the table. we can use the DICTKEY after the column name or as a nmae table definition

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages