In this project, the Big Data Processing tool, Apache Spark, is used to answer three queries on a covid-19 dataset.
Queries' output will be stored onto the well-known document No-SQL datastore, MongoDB.
- PySpark
- PyMongo
pip3 install -r requirements.txt
python3 QueryTool.py -q [queryId] -m [month] -d [date] -c [country] -S [show dataset details] -h [help]
- -q: it's the query identifier.
- -m: it allows to specify the month.
- -d: it allows to specify the date.
- -c: it allows to specify the country.
- -S: it shows dataset details.
- -h: it prompts help menu.
For a given month (user input), for each country compute new cases's average in that month.
This query, for a given month (user input), for each country and for each day, compute:
- The fraction of new deaths in the next week and in the next two weeks.
- New cases in that day.
This query, for a given country (user input), until a date (user input), compute the fraction between cumulative recovered and new cases.