The initial software is provided by the amazing tutorial "How to Implement the Backpropagation Algorithm From Scratch In Python" by Jason Brownlee.
https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/
You should read this tuto which outlines the following steps:
- Initialize Network
- Forward Propagation
- Backpropagation
- Train Network
- Predict
- Seeds Dataset Case Study
I git this soft to sum up what I've learned and add some features proposed by Jason Bronwlee in the "Extensions" part of his tutorial.
To understand backpropagation calculations through a concrete example, take a look at "A Step by Step Backpropagation Example" by Matt Mazur:
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
*bacpropagation.py* implements a **multilayer feed foward neural network**.
There is no activation function because we want to get the characteristics of the raw input vector.
Hidden layer
Five neurons are defined.
Note: Hidden layers could be added thanks to the custom init network function (cf. initialize_network_custom(tab)).
The sigmoid or tanh activation functions are available as a parameter of the evaluate_algorithm() function.
They are three neurons. The number of output neurons is defined by the number of classes found in the dataset outputs. (Here, we are trying to solve a classification problem.)
In classification problems, best results are achieved when the network has one neuron in the output layer for each class value.
The output values are translated into one-hot encoding to match the network outputs.
Our ouput layer uses the same activation function that the hidden layer (sigmoid or tanh).
To predict the class which has the largest probability for one input vector, we are using the arg max function.
The training process uses Stochastic Gradient Descent optimization algorithm. (SGD, called online machine learning algorithm as well).
Note: The optimization is the mechanism which adjusts the weights to increase the accuracy of the predictions.
This network is trained and tested using k-fold cross-validation on seeds_dataset.csv dataset.
As k = 5, five models are fitted and evaluated on 5 different hold out sets. Each model is trained for 500 epochs.
The sum squared error between the expected output and the network output is accumulated each epoch.
Dataset stands for wheat seeds. These inputs are normalized to the range (0, 1).
Neural network without cycle between neurons (ex: no connection between layer N and layer N-2).
Computes output from a neural network by propagating input signals.
The gradient (∇f) of a scalar-valued multivariable function f(x,y,…) gathers all its partial derivatives ( ∂f/∂x, ∂f/∂Y, ...) into a vector.
It is a first order optimization algorithm to find the minimum of a function, generally used in ML when it is not possible to find the solutions of the equation ∂J(θ)/∂θ = 0 (J is the cost function), i.e. all θ which minimize J(θ).
In this ML example, the gradient descent will find a local minimum according to the initial random weights allocated at the neural network initialization. The negative gradient computation shows us in which direction we have to update the weights.
The GD is computed for each iteration by using: θ := θ - η.∇J(θ) (where η is the learning rate).
Error for a single training sample. Square loss: J(ŷi,yi) = (ŷi - yi)^2 where ŷi = f(θ,b,xi) is the predicted output for the xi input. Error for the entire training set. Mean Squared Error: MSE(θ) = J(θ) = 1/N * ∑i=(1..N) (Ŷi - Yi)^2
Classification aims to predict a label. The outputs are class labels.
Regression aims to predict a quantity. The outputs are continuous.
Regression tries to predict outputs of a function according to its inputs (= find the relationship between Y and X).
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y).
Supervised method (gradient descent) to train networks, see the tuto above-mentioned for more details.
Updates weights in a neural network to improve its predictions according to a dataset. Here, the SGD steps:
For each epoch For each train pattern Foward propagation (update the outputs: 'output') Back propagation (update the errors for each neuron: 'delta') Updating weights (update the weight according tot the errors: 'weights')
Data used to train and test the network.
The argument of the maxima refers to the inputs where a function output values are the highest.
(e.g. fox x ∈ [0,π], the arg max of sin(x) is x = π/2 and the max of sin(x) is sin(x) = 1.
It is a procedure used to estimate the skill of the model on new data.
k refers to the number of groups that a given data sample is to be split into.
Sequence:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
A. Take the group as a hold out or test data set
B. Take the remaining groups as a training data set
C. Fit a model on the training set and evaluate it on the test set
D. Retain the evaluation score and discard the model - Summarize the skill of the model using the sample of model evaluation scores
One epoch = One cycle (foward + backward) through the entire training dataset (all the rows "inputs/outputs" seen).
A new unic binary variable is added for each integer value:
red, green, blue
1, 0, 0
0, 1, 0
0, 0, 1
“red” is 1, “green” is 2, and “blue” is 3.