As a Data Scientist, we encounter few major challenges while dealing with large volume of data:
- Popular libraries like Numpy, Pandas are not designed to scale beyond a single core/processor.
- Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine (Scikit-Learn can utilize multiple cores).
- For a laptop or workstation, RAM is often limited to 16 or 32 GB. For Numpy, Pandas, Scikit-Learn, data needs to be loaded into RAM. So, if size of the data exceeds size of the main memory, these libraries can't be used.
In this presentation, I discuss how these challenges can be addressed using open source, parallel computation library Dask.
This talk was delivered as a part of Bangalore Python User Group, BangPyper's Dec' 2020 meetup.
- Recording of the session: here
- Slides: https://speakerdeck.com/arnabbiswas1/scale-up-your-data-science-work-flow-using-dask
conda env create -f environment.yml
conda activate dask-tutorial
jupyter labextension install dask-labextension
jupyter serverextension enable dask_labextension
Data used in this project is from the Kaggle Competition "INGV - Volcanic Eruption Prediction"(https://www.kaggle.com/c/predict-volcanic-eruptions-ingv-oe). Please download the data from Kaggle and place into the data
directory. Once done, the content of the data
directory should have the following files & directories:
sample_submission.csv
test/
train/
train.csv