Entry for the Titanic: Machine Learning from Disaster competition on Kaggle.
If you like KaggleTitanicModels, give it a star, or fork it and contribute!
R version 3.2.0 or higher.
The caret package plus dependencies and suggestions.
The rpart package for feature engineering.
The doParallel package for parallelising training.
To install the required libraries in an R session:
install.packages("caret", dependencies = c("Depends", "Suggests"))
install.packages("rpart") # rpart should be installed with above command
install.packages("doParallel")
Feature engineering is based on Trevor Stephens' tutorial.
Predictive models are built for most of the caret classification methods.
Ten fold cross-validation is used with a wide variety of classification methods including trees, rules, boosting, bagging, neural networks, linear modeling, discriminant analysis, generalised additive modeling, support vector machines, random forests, clustering etc.
Currently 97 classification methods run successfully. A number of slow and problematic methods were excluded.
One of the most accurate caret classification methods is avNNet which is one
of the neural network methods from the venerable
nnet package.
The Survived classes are reasonably balanced so accuracy is an acceptible
performance metric and it's the metric used on the
Kaggle leaderboard.
Confusion matrix for avNNet method on 10-fold cross-validated training data:
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction 0 1
0 55.4 9.3
1 6.2 29.1
Accuracy (average) : 0.8452
Confusion matrix for avNNet method on Kaggle leaderboard data:
0 1
0 215 45
1 52 106
Accuracy : 0.7679
The 20 caret classification methods with highest 10-fold cross-validation accuracies for the Titanic competition are included in the table below:
| method name | accuracy | kappa | runtime (secs) |
|-------------------|----------|--------|----------------|
| xgbDART | 0.8452 | 0.6667 | 598.786 |
| avNNet | 0.8384 | 0.6475 | 71.991 |
| wsrf | 0.8384 | 0.6498 | 123.800 |
| C5.0 | 0.8373 | 0.6496 | 15.838 |
| C5.0Cost | 0.8373 | 0.6496 | 25.714 |
| deepboost | 0.8363 | 0.6431 | 208.503 |
| svmLinear2 | 0.8363 | 0.6475 | 92.594 |
| svmLinearWeights | 0.8363 | 0.6475 | 196.161 |
| svmLinearWeights2 | 0.8362 | 0.6504 | 126.733 |
| svmPoly | 0.8351 | 0.6451 | 685.023 |
| pda | 0.8340 | 0.6442 | 3.151 |
| sda | 0.8340 | 0.6442 | 3.721 |
| svmLinear | 0.8340 | 0.6425 | 43.733 |
| cforest | 0.8329 | 0.6339 | 158.293 |
| bagFDAGCV | 0.8306 | 0.6362 | 144.936 |
| gbm | 0.8306 | 0.6354 | 9.717 |
| nnet | 0.8306 | 0.6317 | 13.312 |
| glmnet | 0.8295 | 0.6352 | 8.829 |
| regLogistic | 0.8295 | 0.6356 | 172.626 |
| glmboost | 0.8284 | 0.6333 | 10.090 |
Note: The xgbDART method has surprisinly bad performance on the Kaggle leaderboard.
These files demonstrate how to build models for most of the supported caret classification methods:
- 1-load.R
- Literally just loads the data
- 2-clean.R
- No cleaning this time!
- There are quite a few missing values but some imputation is attempted in the feature engineering section
- 3-feature-engineering.R
- Based on Trevor Stephens' tutorial
- 4-build-models.R
- Uses 10-fold cross-validation with wide variety of caret classification methods
- Some problematic and slower methods are excluded
- 5-submission.R
- Prepare CSV file for Kaggle submission
- KaggleTitanicModels.RData
- An R session image containing 97 successfully built classification methods
- Large (by GitHub standards) file 84 MBs
To install the required libraries in an R session:
install.packages("caret", dependencies = c("Depends", "Suggests"))
install.packages("rpart") # rpart should be installed with above command
install.packages("doParallel")
The R files can be ran in sequence or the R session image can be loaded.
Clone repository:
git clone /~https://github.com/makeyourownmaker/KaggleTitanicModels
cd KaggleTitanicModels
Either run files in sequence in an R session:
setwd("KaggleTitanicModels")
source("1-load.R", echo = TRUE)
source("2-clean.R", echo = TRUE)
source("3-feature-engineering.R", echo = TRUE)
source("4-build-models.R", echo = TRUE)
source("5-kaggle-submission.R", echo = TRUE)
Or load R session image in an R session:
setwd("KaggleTitanicModels")
load("KaggleTitanicModels.RData")
- Fix some of the failing methods
- Except any methods that depend on rJava
- Except any methods not on CRAN which includes mxnet
- Improve caret hyperparameter tuning
- Caret supports grid search and random search but not Bayesian optimisation
- Try adaptive resampling to tune hyperparameters in a way that concentrates on values that are close to the optimal settings
- Improve feature engineering
- Neural networks and other methods would benefit from scaling and centering
- Others have looked at adding a Cabin deck variable based on the Cabin column
- Consider adding interaction terms
- Additional passenger information is available from the Encyclopedia Titanica
- Add more detailed diagnostics for best performing methods
- Resampling boxplots
- ROC plots
- Re-order classification methods
- By accuracy
- By run time
- Or some compromise between the accuracy and run time
- Caret method limitations
- Some of the caret methods only expose a subset of the tuning parameters from the underlying libraries
- Other caret methods are somewhat limited in the feature interactions they support
- I'm not going to build ensembles of models
- Diminishing returns set in quickly (time would be better spent on feature engineering)
- caretEnsemble is a great library if your interested in that sort of thing
- Kaggle Titanic repositories on github
- Kaggle Titanic kernels on Kaggle
- Titanic passenger list from Encyclopedia Titanica
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.