Neural Network

Jupyter Demos

▶️ Demo | Multilayer Perceptron | MNIST - recognize handwritten digits from 28x28 pixel images.

▶️ Demo | Multilayer Perceptron | Fashion MNIST - recognize the type of clothes (Dress, Coat, Sandal, etc.) from 28x28 pixel images.

Definition

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself isn’t an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such systems “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.

Neuron

Neuron

For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as “cat” or “no cat” and using the results to identify cats in other images. They do this without any prior knowledge about cats, e.g., that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the learning material that they process.

An ANN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

Artificial Neuron

Artificial Neuron

In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called edges. Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the inner layers multiple times.

Neural Network

Neural Network

A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Neuron Model (Logistic Unit)

Here is a model of one neuron unit.

neuron

neuron

x-0

x-0

neuron x

neuron x

Weights:

neuron weights

neuron weights

Network Model (Set of Neurons)

Neural network consists of the neuron units described in the section above.

Let’s take a look at simple example model with one hidden layer.

network model

network model

a-i-j - “activation” of unit i in layer j.

Theta-j - matrix of weights controlling function mapping from layer j to layer j + 1. For example for the first layer: Theta-1.

Theta-j - total number of layers in network (3 in our example).

s-l - number of units (not counting bias unit) in layer l.

K - number of output units (1 in our example but could be any real number for multi-class classification).

Multi-class Classification

In order to make neural network to work with multi-class notification we may use One-vs-All approach.

Let’s say we want our network to distinguish if there is a pedestrian or car of motorcycle or truck is on the image.

In this case the output layer of our network will have 4 units (input layer will be much bigger and it will have all the pixel from the image. Let’s say if all our images will be 20x20 pixels then the input layer will have 400 units each of which will contain the black-white color of the corresponding picture).

multi-class-network

multi-class-network

h-Theta-multi-class

h-Theta-multi-class

In this case we would expect our final hypothesis to have following values:

h-pedestrian

h-pedestrian

h-car

h-car

h-motorcycle

h-motorcycle

In this case for the training set:

training-set

training-set

We would have:

y-i-multi

y-i-multi

Forward (or Feedforward) Propagation

Forward propagation is an interactive process of calculating activations for each layer starting from the input layer and going to the output layer.

For the simple network mentioned in a previous section above we’re able to calculate activations for second layer based on the input layer and our network parameters:

a-1-2

a-1-2

a-2-2

a-2-2

a-3-2

a-3-2

The output layer activation will be calculated based on the hidden layer activations:

h-Theta-example

h-Theta-example

Where g() function may be a sigmoid:

sigmoid

sigmoid

Sigmoid

Sigmoid

Vectorized Implementation of Forward Propagation

Now let’s convert previous calculations into more concise vectorized form.

neuron x

neuron x

To simplify previous activation equations let’s introduce a z variable:

z-1

z-1

z-2

z-2

z-3

z-3

z-matrix

z-matrix

Don’t forget to add bias units (activations) before propagating to the next layer. a-bias

z-3-vectorize

z-3-vectorize

h-Theta-vectorized

h-Theta-vectorized

Forward Propagation Example

Let’s take the following network architecture with 4 layers (input layer, 2 hidden layers and output layer) as an example:

multi-class-network

multi-class-network

In this case the forward propagation steps would look like the following:

forward-propagation-example

forward-propagation-example

Cost Function

The cost function for the neuron network is quite similar to the logistic regression cost function.

cost-function

cost-function

h-Theta

h-Theta

h-Theta-i

h-Theta-i

Backpropagation

Gradient Computation

Backpropagation algorithm has the same purpose as gradient descent for linear or logistic regression - it corrects the values of thetas to minimize a cost function.

In other words we need to be able to calculate partial derivative of cost function for each theta.

J-partial

J-partial

multi-class-network

multi-class-network

Let’s assume that:

delta-j-l - “error” of node j in layer l.

For each output unit (layer L = 4):

delta-4

delta-4

Or in vectorized form:

delta-4-vectorized

delta-4-vectorized

delta-3-2

delta-3-2

sigmoid-gradient - sigmoid gradient.

sigmoid-gradient-2

sigmoid-gradient-2

Now we may calculate the gradient step:

J-partial-detailed

J-partial-detailed

Backpropagation Algorithm

For training set

training-set

training-set

We need to set:

Delta

Delta

backpropagation

backpropagation

Random Initialization

Before starting forward propagation we need to initialize Theta parameters. We can not assign zero to all thetas since this would make our network useless because every neuron of the layer will learn the same as its siblings. In other word we need to break the symmetry. In order to do so we need to initialize thetas to some small random initial values:

theta-init

theta-init