Climate crawler is project made for the databases II (IC4302) course in the Costa Rica Institute of Technology
In order to build and run this project, you need the following software:
- The prerequisites of pydoop specified in their documentation: http://crs4.github.io/pydoop/_pydoop1/installation.html#prerequisites
- Python virtual enviroment (venv)
- Hadoop installed with the a bashrc home specifier (hadoop bash directory contains the needed configuration)
- Sqoop 1.4.7 and mysql-connector-java
The data fetched by the web crawler is retrieved from https://en.tutiempo.net/climate/
Hadoop is used for processing and storing data of the web crawler output in the HDFS, map reduce jobs are made in a high level python api for hadoop called pydoop. The map reduce jobs for every variable are:
- The 10 countries with the highest overall averages
- The 10 countries with the lowest overall averages
- For each country the year in which each of the variables was the maximum
- For each country the year in which each of the variables was the minimum
- Average temperature for each continent, in groups of 10 years
- By Country the station that has the maximum values
- By Country the station that has the minimum values
- By Continent the countries with the maximum values
- By Continent the countries with the minimum values
The database of choice is MySQL
The web interface is made in NodeJS and react