The goal of this project is to build a PostgreSQL database utilizing the data on users activity and songs metadata. Building the database helps us do complex analytics regarding users activity as well as song play analysis.
- Songs metadata: collection of JSON files that describes the songs such as title, artist name, year, etc.
- Logs data: collection of JSON files where each file covers the users activities over a given day.
We’ll build the database by optimizing the tables around efficient reads for complex queries. To do that, Star schema will be used utilizing dimensional modeling as follows:
- Fact table: songplays.
- Dimensions tables: songs, artist, users, time.
The three most important advantages of using Star schema are:
- Denormalized tables.
- Simplified queries.
- Fast aggregation.
The source code is available in three separate Python scripts. Below is a brief description of the main files:
sql_queries.pyhas all the queries needed to both create/drop tables for the database as well as a SQL query to get song_id and artist_id from other tables since they are not provided in logs dataset.
create_tables.pycreates the database, establish the connection and creates/drops all the tables required using sql_queries module.
etl.pybuild the pipeline that extracts the data from JSON files, does some transformation (such as adding different time attributes from timestamp) and then insert all the data into the corresponding tables.
Therefore, we first run
etl.py to create the database, create tables, and then insert the data using the ETL pipeline.
%load_ext sql %sql postgresql://student:firstname.lastname@example.org/sparkifydb %%sql SELECT COUNT(*) from songplays;
1 rows affected.
Out: count 6820
Check out the project on github.