This repository provides two solutions for setting up your environment: 1) docker or, 2) local environment configuration
We recommend using docker because we will kept our latest environments up-to-date within the Dockerfiles.
Dockerfile(s) are found in ./docker
:
(CPU) milesg-cpu-Dockerfile
(GPU) milesg-Dockerfile
If you are building the image for the first time, cd into the main directory of this repo and in your terminal execute either,
NOTE: The images are available on DockerHub, so you can run:
docker pull milesg/kaggle-icc:latest
docker pull milesg/kaggle-icc-cpu:latest
...to avoid a long build process locally.
(CPU)
docker-compose --file docker-compose-cpu.yml up
(GPU)
docker-compose up
Note: Containers using nvidia-docker can be used with docker-compose by setting
/etc/docker/daemon.json to the following:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": [ ]
}
}
}
The easiest way to launch our docker container is:
(CPU)
docker run --rm -v $(pwd):/code -p 8888:8888 milesg/kaggle-icc-cpu jupyter notebook --allow-root --ip 0.0.0.0 --port 8888 --no-browser
(GPU)
docker run --runtime=nvidia --rm -v $(pwd):/code -p 8888:8888 milesg/kaggle-icc jupyter notebook --allow-root --ip 0.0.0.0 --port 8888 --no-browser
If you prefer not to use docker, there are two Anaconda environment files you can use to create the same conda environment used in the docker image:
conda-env-cpu.yml
& conda-env-gpu.yml
for CPU and GPU environments, respectively.
Create an environment by:
conda env create --file conda-env-cpu.yml --name icc-cpu
Update an environment with:
conda env update --file conda-env-cpu.yml --name icc-cpu
Assumes data has been loaded into data
directory from the competition page.
If files are compressed in .7z format, and system has 7z installed, it will automatically be uncompressed.
from icc.data_loader import DataLoader
X, y = DataLoader.load_train() # type: pd.DataFrame, pd.Series
X = DataLoader.load_test() # type: pd.DataFrame
sample = DataLoader.load_sample_submission() # type: pd.DataFrame
This data will be the EXACT same data used to pass to your model(s) within the StackedClassifier
Reference icc.models.example_model.py
for full example.
Your model MUST inherit from sklearn.base.BaseEstimator
and implement the following:
fit(X, y)
-> fit your model and returnself
predict(X)
-> return 1d array of predicted classespredict_proba(X)
-> return array of shape [n_samples, 2] (probabilities of 0 and 1.. [[0.4, 0.6], ...])get_params(deep=True)
-> return dict of parameters specifies in your model's__init__
If you can't get Docker, or the environment to work you can develop your model in your own environment; so long as the libs you're using are the ones we're using and your model will run in the following code:
from icc.models import YourSweetModel
from icc.data_loader import DataLoader
from sklearn.model_selection import cross_val_score
# Load training data and do cross val scoring
X, y = DataLoader.load_train()
model = YourSweetModel() # Your model should take no args
cross_val_score(model, X, y, scoring='neg_log_loss')
# Fit and predict on actual test data
model.fit(X, y)
xTest = DataLoader.load_test()
model.predict(xTest)
If your model makes it through this gaunlat, then it has a VERY good chance of working in the overall model! Submit a Pull Request today! :)
Example:
from icc.ml_stack import StackedClassifier
@StackedClassifier.register
class MySweetModel:
def __init__(self):
...
def fit(self, X, y):
...
def predict(self, X):
...
def predict_proba(self, X):
...
def get_params(self, deep=True):
...
from icc.ml_stack import run_stack
run_stack()
Or from the commandline:
python icc/ml_stack.py
This outputs a data/submission.csv
file ready for Kaggle.