This project implements a neural network from scratch using Numpy to classify handwritten digits from the MNIST dataset.
The project includes a custom implementation of a fully connected neural network with configurable layers and activation functions. The network is trained to recognize and classify handwritten digits from the MNIST dataset.
- Custom neural network architecture using only Numpy for matrix operations.
- Activation functions: Sigmoid, ReLU, Leaky ReLU, and Tanh.
- Functions for forward propagation and backpropagation.
- Gradient descent optimization for network training.
- Evaluation of model accuracy on test data.
- Visualization of predictions using Matplotlib.
-
Prerequisites
- Ensure Python 3.7.9 or a newer version is installed on your system.
- A basic understanding of Python and neural networks is recommended.
-
Setup
- Clone the repository or download the project files to your local machine.
- Ensure that
MNIST.py
,Layer_dense.py
,Network.py
, and any other required files are located in the same directory.
-
Install Dependencies
- In your project directory, run
pip install -r requirements.txt
to install necessary libraries. - If you encounter issues, install the libraries individually as listed in the
requirements.txt
file usingpip install
.
- In your project directory, run
-
Prepare Your Data
- The MNIST dataset is automatically loaded and preprocessed by the
MNIST.py
script. - Each image, originally a 28x28 pixel 2D array, is flattened into a 1D array of 784 features.
- The training and testing datasets are therefore transformed into 2D arrays with shapes
(60000, 784)
and(10000, 784)
, respectively, ready for network input.
- The MNIST dataset is automatically loaded and preprocessed by the
-
Run the Neural Network
- Open your command line interface and navigate to the project directory.
- Run the script using the command
python MNIST.py
. - The script will manage the loading of data, training of the neural network, and evaluation of its performance.
-
Monitor Training
- Keep an eye on the console for training progress, including updates on loss and accuracy metrics.
-
Evaluate Results
- Upon completion of training, the script will display the final performance metrics on the MNIST dataset.
- It will also show a 5x5 grid image with a sample of test data for visual evaluation.
-
Customizing the Neural Network
- You can customize the network architecture in the
MNIST.py
file. - Import the
NeuralNetwork
andLayer_dense
classes:from Network import NeuralNetwork from Layer_Dense import Layer_dense
- Create a
NeuralNetwork
instance and add layers as needed:network = NeuralNetwork() network.add_layer(Layer_dense(input_size, number_of_neurons)) # Add more layers if necessary
- Load your dataset (use MNIST or a custom dataset) and ensure it is in the correct format.
- You can customize the network architecture in the
-
Training and Evaluating Your Network
- Train the network using:
network.train(data, labels, epochs, batch_size, learning_rate)
- Evaluate the network's performance on test data:
test_accuracy = network.evaluate(test_X[1:5000], test_y[1:5000]) print(f"Test Accuracy: {test_accuracy}%")
- Train the network using:
- Layer Configuration: Adjust the number and size of layers to suit your specific task.
- Data Handling: Ensure your data is formatted correctly for the network. The class handles data normalization, but data should be clean and appropriately shaped.
- Tuning Parameters: Experiment with different epochs, batch sizes, and learning rates to optimize performance for your specific problem.
In the following chapters of these notes, we'll briefly look at all the code to understand its inner workings. Starting with:
In MNIST.py an example code showcases an implementation of this project to tackle the MNIST dataset for image classification.
- Data Preparation: The MNIST dataset is loaded, with each image reshaped into a 1D array (
28x28
pixels to784
features) to match the input size of the neural network. - Hyperparameters:
- Input Size:
784
(flattened28x28
images) - Hidden Layer Size:
64
- Output Size:
10
(digits0-9
) - Epochs:
10
- Batch Size:
32
- Learning Rate:
0.001
- Input Size:
- Network Architecture: The network comprises a hidden layer with ReLU activation and an output layer with Sigmoid activation.
- Training: The network is trained on the MNIST dataset using forward and backward propagation, with training progress displayed through a tqdm progress bar.
# Neural network creation and training
network = NeuralNetwork()
network.add_layer(Layer_dense(784, 64, 'ReLU'))
network.add_layer(Layer_dense(64, 10, 'Sigmoid'))
network.train(train_X, train_y, 10, 32, 0.001)
The NeuralNetwork class is the central branch of this project as it connects and uses the Utils and Layer_Dense scripts that allow the network to automatically train without lots of code.
The Utils script stores two important functions that will make transcribing and selecting batches of data a lot easier to handle. So far the functions in this file are:
def batch(data, labels, batch_size):
    for i in range(0, len(data), batch_size):
      yield data[i:i + batch_size], labels[i:i + batch_size]
This function lets us collect batches (or subsets) of our total data.
The batch
function takes three parameters:
data
: The complete set of data points (features).labels
: The corresponding labels for the data points.batch_size
: The size of each batch.
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
labels = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
In this code we have two np arrays called data and labels. If we wanted to separate these arrays in batches of three we can call:
for b_data, b_labels in batch(data, labels, 3):
print(f"Batch data: {b_data}, Batch labels: {b_labels}")
This for loop declares the variables b_data and b_labels equal to the arrays yielded by the batch function after one of it's iteration. Note that:
- 1st iteration:
i = 0, as the loop increments
i` by the batch size, which is 3.b_data
=data[0:3]
=[1, 2, 3]
.b_labels
=labels[0:3]
=['a', 'b', 'c']
.
- 2nd Iteration:
i = 3
, as the loop incrementsi
by the batch size, which is 3.b_data
=data[3:6]
=[4, 5, 6]
.b_labels
=labels[3:6]
=['d', 'e', 'f']
. The second batch consists of the next three elements from bothdata
andlabels
.
- 3rd Iteration:
i = 6
, following the pattern of incrementing by the batch size.b_data
=data[6:9]
=[7, 8, 9]
.b_labels
=labels[6:9]
=['g', 'h', 'i']
. The third batch includes elements 7, 8, and 9 fromdata
and their corresponding labels.
- 4th Iteration:
i = 9
, again incremented by 3.b_data
=data[9:12]
=[10]
.b_labels
=labels[9:12]
=['j']
. In this final iteration, the batch is smaller because there are no more elements left indata
andlabels
to make a full batch of 3. The function handles this gracefully by just returning the remaining elements.
The one_hot_encode
function is used for converting categorical labels into a one-hot encoded format. One-hot encoding is a technique where each categorical label is converted into a binary vector representing the presence or absence of each class.
np.eye(num_classes)
: Creates an identity matrix of sizenum_classes x num_classes
. Each row in this matrix represents a one-hot encoded vector for a class.[labels]
: Uses thelabels
array to index into the identity matrix. Each element in thelabels
array is used as an index to select the corresponding row from the identity matrix, resulting in a one-hot encoded vector for that label.
Suppose you have a set of labels for a classification task with 3 classes. Here's how the one_hot_encode
function operates:
import numpy as np
labels = np.array([0, 2, 1, 0, 2]) # Example labels
num_classes = 3 # Number of classes
def one_hot_encode(labels, num_classes):
return np.eye(num_classes)[labels]
encoded_labels = one_hot_encode(labels, num_classes)
print(encoded_labels)
Outputs:
[[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
- The label 0 is encoded as
[1. 0. 0.]
. - The label 1 is encoded as
[0. 1. 0.]
. - The label 2 is encoded as
[0. 0. 1.]
.
Each row in the output is a one-hot encoded vector corresponding to a label from the original labels array.
The Layer_Dense script includes a class consisting of all the available activation functions and most importantly the Layer_Dense object.
This class contains the following activation functions: Sigmoid
, ReLU
, Leaky
, ReLU
, and Tanh and their corresponding derivatives named with a d and a dash like so: d_relu
, d_sigmoid
etc.
To use any function in this class simply from Layer_dense import Activation
and x = Activation.relu(inputs)
and x
will simply be an array of the shape of inputs where every single entry has had ReLU applied to it
The Layer_dense
class implements a fully connected neural network layer with customizable activation functions. It is designed for use in building and training neural networks from scratch.
- Fully Connected Layer: Implements a dense layer where each neuron is connected to all neurons in the previous layer.
- Customizable Activation Functions: Supports Sigmoid, ReLU, Tanh, and Leaky ReLU activations, with extendability for more.
- Forward Propagation: Computes the weighted sum of inputs and applies the activation function.
- Backpropagation Support: Facilitates the calculation of gradients for backpropagation.
- Parameter Update: Allows updating the layer's weights and biases using gradient descent.
number_inputs
(int): Number of input features.number_outputs
(int): Number of neurons in the layer.activation_function
(str): Type of activation function to use.
Weights and biases are initialized with small random values for symmetry breaking.
- Sigmoid: Outputs values between 0 and 1, useful for binary classification.
- ReLU: Passes positive values and blocks negatives, useful for non-linear transformations.
- Tanh: Scales output between -1 and 1.
- LeakyReLU: Variation of ReLU that allows small negative values.
.Forward(batch_inputs)
: Computes the output of the layer for a given batch of inputs..BackProp(dvalues)
: Performs backpropagation for the layer using the predictions error and subsequently the layer's error, calculating gradients with respect to inputs and weights..update_parameters(learning_rate)
: Updates the layer's weights and biases based on calculated gradients and a given learning rate.
import numpy as np
from Layer_dense import Layer_dense
# Initialize a dense layer with 5 inputs, 10 outputs, and ReLU activation
layer = Layer_dense(5, 10, 'ReLU')
# Example input: 3 samples, 5 features each
inputs = np.random.randn(3, 5)
# Forward pass
output = layer.Forward(inputs)
# Backpropagation with dummy gradients (assuming 10 outputs)
dvalues = np.random.randn(3, 10)
gradients = layer.BackProp(dvalues)
# Update parameters with a learning rate of 0.01
layer.update_parameters(0.01)
network = NeuralNetwork()
network.add_layer(Layer_dense(784, 128, 'ReLU'))
network.add_layer(Layer_dense(128, 10, 'Sigmoid'))
The NeuralNetwork
class is designed to represent a multi-layer neural network. It provides a flexible structure for building and training neural networks with varying architectures.
- Layer Management: Facilitates adding multiple layers to the network.
- Forward Propagation: Sequentially processes inputs through all layers.
- Backpropagation: Supports backward propagation of errors for training.
- Parameter Update: Implements parameter updates for all layers using gradient descent.
- Prediction: Ability to predict outputs for given inputs.
- Accuracy Calculation: Computes the accuracy of predictions against true labels.
The neural network is initialized with an empty list of layers:
network = NeuralNetwork()
.add_layer(layer)
: Adds a layer to the neural network..forward(inputs)
: Performs a forward pass through the network by sequentially passing the input through each layer..backward(loss_gradient)
: Executes the backpropagation algorithm through the network in reverse order, starting from the output layer..update_parameters(learning_rate)
: Updates the parameters (weights and biases) of each layer in the network based on the gradients computed during backpropagation..predict(inputs)
: Predicts the output for given inputs by performing a forward pass and then applying argmax to the final layer's outputs to obtain the predicted class labels..calculate_accuracy(predictions, labels)
: Calculates the accuracy of the predictions, comparing them to the true labels. Both inputs are expected to be one-hot encoded..evaluate
: It evaluates the accuracy of the network by passing a single batch of test data and returning a value for accuracy
The train
method of the NeuralNetwork
class is designed to train the neural network using a given dataset. It incorporates forward propagation, loss calculation, backpropagation, and parameter updates in a loop to optimize the network's weights and biases. Below is an explanation of how this method works:
def train(self, train_X, train_Y, epochs, batch_size, learning_rate):
train_X
: The training data (features).train_Y
: The corresponding labels for the training data.epochs
: The number of times the entire dataset is passed through the network.batch_size
: The number of samples in each batch for training.learning_rate
: The step size used for updating the weights during training.
-
Initialization: At the start of each training epoch, we initialize
epoch_loss
andepoch_accuracy
to zero. These variables are crucial for tracking the model's performance throughout the epoch.for epoch in range(number_of_epochs): epoch_loss = 0 epoch_accuracy = 0 # Further processing in the loop...
-
Batch Processing: The dataset is divided into smaller batches using the
batch
function. For a better user experience, we usetqdm
to create a dynamic progress bar in the console.from Utils import batch from tqdm import tqdm for batch_x, batch_y in tqdm(batch(train_data, train_labels, batch_size)): # Batch training steps follow...
-
Batch Training:
- Normalization: Each batch of inputs (
batch_x
) is normalized by scaling pixel values to the range [0, 1].batch_x /= 255.0
- Forward Pass: The
forward
method of the network processes each batch to produce output predictions.output = network.forward(batch_x)
- One-Hot Encoding: Target labels (
batch_y
) are converted into a one-hot encoded format.batch_y_encoded = one_hot_encode(batch_y, num_classes)
- Loss Calculation: Cross-entropy loss is computed as a measure of the network's prediction accuracy.
loss = -np.sum(batch_y_encoded * np.log(output + 1e-7)) / batch_size epoch_loss += loss
- Accuracy Calculation: The
calculate_accuracy
function assesses how well the network's predictions match the true labels.accuracy = calculate_accuracy(output, batch_y) epoch_accuracy += accuracy
- Normalization: Each batch of inputs (
-
Error Gradient Simplification: Simplifying the gradient for the categorical cross-entropy loss function, we use the difference between outputs and encoded labels.
dvalues = output - batch_y_encoded
-
Backward Pass and Parameter Update:
- Backward Pass: Backpropagation is performed using the simplified error gradient, calculating gradients for each network layer.
network.backward(dvalues)
- Parameter Update: Network parameters (weights and biases) are updated based on the gradients and learning rate.
network.update_parameters(learning_rate)
- Backward Pass: Backpropagation is performed using the simplified error gradient, calculating gradients for each network layer.
-
Progress Update: After each batch, the
tqdm
progress bar is updated to reflect current loss and accuracy metrics.tqdm.set_description(f'Epoch {epoch + 1}/{number_of_epochs} | Loss: {epoch_loss / number_of_batches} | Accuracy: {epoch_accuracy / number_of_batches}')
-
Epoch Summary: Post-training on all batches, the cumulative epoch loss and accuracy are averaged and reported, providing insights into the network's learning progress.
avg_epoch_loss = epoch_loss / number_of_batches avg_epoch_accuracy = epoch_accuracy / number_of_batches print(f'Epoch {epoch + 1} completed: Avg. Loss: {avg_epoch_loss}, Avg. Accuracy: {avg_epoch_accuracy}')
- Sigmoid Function and Derivative: [ \begin{gather} \sigma(x) = \frac{1}{1 + e^{-x}} \ \frac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x)) \end{gather} ]
- ReLU (Rectified Linear Unit) and Derivative: [ \begin{gather} \text{ReLU}(x) = \max(0, x) \ \frac{d(\text{ReLU}(x))}{dx} = \begin{cases} 1 & \text{if } x > 0 \ 0 & \text{otherwise} \end{cases} \end{gather} ]
- Leaky ReLU and Derivative: [\begin{gather} \text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \ \alpha x & \text{otherwise} \end{cases} \ \frac{d(\text{LeakyReLU}(x))}{dx} = \begin{cases} 1 & \text{if } x > 0 \ \alpha & \text{otherwise} \end{cases} \end{gather} ]
- Hyperbolic Tangent (Tanh) and Derivative: [ \begin{gather} \tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} \ \frac{d(\tanh(x))}{dx} = 1 - \tanh^2(x) \end{gather} ]
Forward propagation is the process a neural network goes from an input to a prediction. Given a neural network with
- For the input layer (
$l=1$ ), the input$\mathbf{x}$ is provided directly to the network: [ \mathbf{a}^{(0)} = \mathbf{x} ] - For each subsequent layer from
$l=2$ to$L$ , the forward pass is computed using the following equations: [ \begin{gather} \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\ \mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)}) \end{gather} ]Where:
-
$\mathbf{W}^{(l)}$ is the weight matrix for layer$l$ . -
$\mathbf{b}^{(l)}$ is the bias vector for layer$l$ . -
$\mathbf{a}^{(l-1)}$ is the activation from the previous layer$l-1$ , with$\mathbf{a}^{(0)} = \mathbf{x}$ being the input to the network. -
$\mathbf{z}^{(l)}$ is the weighted input to the activations of layer$l$ . -
$\sigma^{(l)}()$ is the activation function for layer$l$ . -
$\mathbf{a}^{(l)}$ is the output of layer$l$ , which will be the input to the next layer or the final output of the network if$l=L$ .
The output after the last layer
The calculation of the error term for the final layer, denoted as
-
$\mathbf{z}^{(L)}$ is the weighted input to the final layer's activation function. -
$\sigma'^{(L)}(\mathbf{z}^{(L)})$ is the derivative of the activation function at the final layer. -
$\mathcal{E}^{(L)}$ is derived by differentiating the cost function (e.g., MSE or Cross-Entropy) with respect to the output neurons and represents the initial error signal that will propagate back through the network.
Once the error at the output layer
- (
$\mathbf{W}^{(l+1)} \ \text{are the weights connecting layer} \ (l) \ \text{to layer} \ (l+1)$ . -
$\Omega^{(l+1)}$ is the error term from the subsequent layer.
The sensitivity of the activation function at layer
-
$\sigma'^{(l)}(\mathbf{z}^{(l)})$ is the derivative of the activation function at layer$( l )$ . -
$\odot$ denotes the element-wise multiplication.
This term,
With
-
Weights: The weight gradient,
$\Delta \mathbf{W}^{(l)}$ , involves the activations from the preceding layer and the sensitivity of the current layer:$$\nabla_{\mathbf{W}}\mathcal{J}^{(l)}= \mathbf{a}^{(l-1)T} \cdot \Omega^{(l)}$$ -
Biases: The bias gradient,
$\Delta \mathbf{b}^{(l)}$ , sums the sensitivities across the batch: [\ \nabla_{\mathbf{b}}\mathcal{J}^{(l)}= \sum(\Omega^{(l)}) ] Parameters are updated in the direction that reduces loss, modulated by the learning rate.
The parameters of the network, weights
-
$\alpha$ is the learning rate. -
$\nabla_{\mathbf{W}} \mathcal{J}$ is the gradient of the cost function$\mathcal{J}$ with respect to the weights. -
$\nabla_{\mathbf{b}} \mathcal{J}$ is the gradient of the cost function with respect to the biases.
The accuracy of the network's predictions is calculated as follows: [\ \begin{gather}\text{Accuracy (acc)} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)\end{gather} ] Where:
-
$N$ is the total number of samples. -
$\hat{y}_i$ is the predicted label for the$i$ -th sample. -
$y_i$ is the true label for the$i$ -th sample. -
$\mathbb{1}(\cdot)$ is the indicator function, which is 1 when the predicted label matches the true label and 0 otherwise.
Consider a scenario where we have a batch of 5 samples with the following predicted and true labels:
- Predicted labels:
$\hat{y} = [1, 2, 1, 0, 2]$ - True labels:
$y = [1, 1, 1, 0, 2]$
Using the accuracy formula: [ \begin{gather} \text{Accuracy (acc)} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)\end{gather} ] We calculate the accuracy as: [\begin{gather} \text{acc} = \frac{1}{5} (\mathbb{1}(1 = 1) + \mathbb{1}(2 = 1) + \mathbb{1}(1 = 1) + \mathbb{1}(0 = 0) + \mathbb{1}(2 = 2)) \ \text{acc} = \frac{1}{5} (1 + 0 + 1 + 1 + 1) = \frac{4}{5} = 0.8 \end{gather} ]The accuracy for these predictions is 0.8 or 80%, indicating that 4 out of 5 predictions match the true labels.
The cross-entropy loss function is used for classification tasks and is defined as: [ \begin{gather} \mathcal{J} = -\sum_{i=1}^{C} y_i \cdot \log(\hat{y}_i) \end{gather} ] Where:
-
$C$ is the number of classes. -
$y_i$ is the true label, which is 1 for the correct class and 0 otherwise (one-hot encoded). -
$\hat{y}_i$ is the predicted probability for class$i$ .
This loss function penalizes incorrect predictions with a higher cost, increasing as the predicted probability diverges from the actual label.
To assess the model's performance during training, we compute the average loss and accuracy per epoch:
The average loss over an epoch is calculated by aggregating the individual losses and dividing by the total number of batches: [ \begin{gather} \text{epoch_loss} = \frac{1}{N} \sum_{j=1}^{N} \mathcal{J}_j \end{gather} ] Where:
-
$N$ is the total number of samples in the epoch. -
$\mathcal{J}_j$ is the loss for batch$j$ .
The average accuracy over an epoch is the proportion of correct predictions: [ \begin{gather} \text{epoch_accuracy} = \frac{1}{N} \sum_{k=1}^{N} \mathbb{1}(\hat{y}_k = y_k) \end{gather} ] Where:
-
$N$ is the total number of samples in the epoch. -
$\hat{y}_k$ is the predicted label for the$k$ -th sample. -
$y_k$ is the true label for the$k$ -th sample. -
$\mathbb{1}(\cdot)$ is the indicator function, equal to 1 when the prediction is correct and 0 otherwise.