Skip to content

AmimerNabil/Random-Forest-Implementation

Repository files navigation

Random Forest Classifier Implementation 🌲

Description

Hello to everyone! This project is a simple implementation of a Random forest from scratch using python. The goal of this project is to make it from scratch in order to understand the backdoor of this machine learning tool.

I got interested in this subject right after my first class of probability and statistics (fall 2021) and I am still fresh to the subject.

Random Forests

Random Forests are a machine learning tool used for both classification and regression problems. They are currently unexcelled in accuracy among current algorithms. Single Decision trees lack flexibility and pose some overfitting issues. When designing the decisions trees, we have to define a stopping criterion and the amount of pruning to be done. Too much pruning and your model won't fit the training set that we feed it. Not enough pruning and the model will overfit the data increasing its variance and affecting its efficiency.

Random forests are a way to fix this issue...by creating many large trees (with no prunning) and combining them, we greatly enhance the performence of the prediction model. On the downside, it is very hard to interpret many large trees.

for more information and details on RF : Random Forest Documentation

details of implementation ->

The random forest implemented follows the CART protocol for random forest creation. We first split the data into a boostrap dataset by randomly selecting with replacement data from the dataset and then creating the branches of the tree by selecting the best split.

The best split is determined using:

  • for continuous :
    • the minimum squared error
  • for categorical :
    • the gini index

when creating the branches of the random forest, missing values are handled in a simplified way which can be found in the DecisionTree.py class under the "createBranches" method.

There is indeed a lot of room for improvement (which I plan on doing) but the general purpose has been accomplished.

structure of the project

-> The DataSet Folder : it contains the data sets that used for testing the random forest.

-> [RandomForest.py, DecisionTree.py, Nodes.py] : files used to create an instance of a random froest.

how to use it :

The random forest now works and can be used to create random forest classifiers.


To use the random forest, 0. import the necessary modules :

# -*- coding: utf-8 -*-
import pandas as pd
import RandomForest as RF	
  1. create the Pandas DataFrame:
#creation of the pandas dataFrame
dataFrame = pd.read_csv("DataSets/processed.cleveland.data")
  1. create a datatypeClassifier :
#a list which contains the type of attribute present in our dataframe.
'''
this list only contains whether a columns/covariate
contains categorical or continous variables
0 : categorical || 1 : continuous
'''

dataTypeClassifier = {}

RF.defineDataTypeClassifier(dataFrame, dataTypeClassifier)

Doing this step before creating the random forest makes it possible for you to make any changes to the type of variable in the dataTypeClassifier.

This implementation has a specific way of determining whether a covariate is continuous or categorical (can be found in the Decision.py in DefineDataTypeClassifier)

you can make any changes to the type in this manner ->

dataTypeClassifier["name of attribute (covariate)"] = 0
  1. define a training and test set for the random forest
#splitting of the dataframe into a training and testing set.
dataFrameTrain = dataFrame.iloc[:4000].copy()
dataFrameTest = dataFrame.iloc[4000:].copy()
  1. create the random forest using the training set.
rf = RF.RandomForest(dataFrameTrain, dataTypeClassifier, 50, "rings")

'''
you can then test it using some random index in your dataFrameTest set in this manner
'''
randomIndex = random.choice(dataFrameTest.index)
member = dataFrameTest.loc[randomIndex]

rf.predict(member)

what is left to do?

  • The current implementation only supports categorical random forests. However I would like to implement its counterpart for regression problems.
  • implement concrete evaluation tools for the random forest.
  • implement visual tools to help decoding Single Decision trees.

Licence :

MIT License

Copyright (c) [2022] [Nabil Amimer]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

About

Implementation of a random forest from scratch using Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages