Building a Neural Network from Scratch: using ONLY Numpy

7 min readSep 18, 2024

Envision imparting human-like handwriting recognition skills to a machine without the need for sophisticated libraries or frameworks. Greetings from the world of pure NumPy, where we will delve into the core of neural networks and create a robust model from scratch for digit recognition. See the wonders of creating a neural network without TensorFlow, PyTorch, or Keras in this blog. We’ll walk you through each step, from setting up the network to optimising its functionality, all the while revealing the complex inner workings. Are you prepared to witness how a few lines of code can convert unprocessed data into insightful forecasts? Fast-forward to the intriguing building and training of neural networks by fastening your seatbelt!

Neural Networks

A subclass of machine learning algorithms called neural networks is motivated by the composition and operation of the human brain. They are made up of layers of networked nodes, or neurones, each of which processes incoming data and transmits the output to the one above it. These layers use a number of mathematical processes, such as weighted sums and activation functions, to convert the input data into meaningful output.

Today we will be building a Neural Network from scratch without using Tensorflow or Pytorch library, but we will use ONLY Numpy. I will be creating a handwritten digit recognition network using MNIST CSV dataset by Dariel Dato-on on Kaggle, check the dataset out:

https://www.kaggle.com/datasets/oddrationale/mnist-in-csv

Requirements:

Basic Python
Basic Numpy

And you’re good to go!!!

Building a Neural Network from scratch requires a lot of calculations. But I’ll try to keep the blog as simple as possible.

Step 1. Loading Dataset

The first step towards building a Neural Network is loading the dataset. We are importing the MNIST dataset — which includes pictures of handwritten numbers — into this block. Every image is represented by a flattened 784-element vector that is a 28x28 pixel grid. The first column of the CSV file contains the labels (digits 0–9), and the other columns include the pixel values.

def load_data(train_csv, test_csv):
    train_data = pd.read_csv(train_csv)
    test_data = pd.read_csv(test_csv)
    
    X_train = train_data.drop(columns=['label']).values / 255.0
    y_train = train_data['label'].values
    
    X_test = test_data.drop(columns=['label']).values / 255.0
    y_test = test_data['label'].values
    
    return X_train, y_train, X_test, y_test

Step 2. One Hot Encoding Labels

We must convert the labels into a one-hot encoded format because we are solving a classification issue with ten potential output classes (digits 0–9). Accordingly, each label will be represented as a vector with 10 elements, with the right class being marked as 1 and the remaining classes as 0.

def one_hot_encode(y, num_classes):
    return np.eye(num_classes)[y]

Step 3. Defining Activation Functions

A neural network that has activation functions can learn and describe complicated patterns in data by introducing non-linearity. Regardless of the number of layers, the network would respond like a linear regression model in the absence of activation functions, which would restrict its capacity to handle increasingly difficult tasks.

Let’s now examine more closely at the two particular activation functions we employed, Sigmoid and Softmax.

Sigmoid

Non-linearity is introduced in the hidden layer by the application of the sigmoid function. Regardless of the number of layers in the network, non-linearity is essential because without it, the network would act like a linear model.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Softmax

In the output layer, the softmax function is employed to transform raw scores (logits) into probabilities. This is especially helpful when classifying multiple classes.

def softmax(z):
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

Step 4. Building the Network

We configure the network’s weights and biases at the initialisation stage, which is crucial for learning. To guarantee diversified learning features and break symmetry, weights (W1 and W2) are initialised with modest random values. By setting the biases (b1 and b2) to zero, the model can modify the activation function without the need for randomisation. The magnitude of weight updates is determined by the learning rate; a value of 0.1 guarantees large adjustments while accounting for the possibility of overshooting. Adequate initialisation is essential for both training and convergence to occur.

class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))
        self.learning_rate = learning_rate

Forward Propagation

The neural network processes data during the forward propagation phase in order to produce predictions. Three essential phases are involved in this process: applying activation functions, computing linear combinations, and retrieving the result.

# Previous code

def forward(self, X):
    self.z1 = np.dot(X, self.W1) + self.b1
    self.a1 = sigmoid(self.z1)
    self.z2 = np.dot(self.a1, self.W2) + self.b2
    self.a2 = softmax(self.z2)
    return self.a2

Backward Propagation

We calculate the gradients of the loss function with respect to the weights and biases during the backward propagation phase. By using this data, the parameters are changed to reduce loss and raise the accuracy of the model.

# Previous Code

def backward(self, X, y_true, y_pred):
    m = X.shape[0]
        
    dz2 = y_pred - y_true
    dW2 = np.dot(self.a1.T, dz2) / m
    db2 = np.sum(dz2, axis=0, keepdims=True) / m

    dz1 = np.dot(dz2, self.W2.T) * sigmoid_derivative(self.a1)
    dW1 = np.dot(X.T, dz1) / m
    db1 = np.sum(dz1, axis=0, keepdims=True) / m

    self.W1 -= self.learning_rate * dW1
    self.b1 -= self.learning_rate * db1
    self.W2 -= self.learning_rate * dW2
    self.b2 -= self.learning_rate * db2

Training Process

In order to reduce the loss function and raise the accuracy of the model, the neural network’s weights and biases are iteratively adjusted during the training phase. The training function is explained in detail below:

# Previous Code

def train(self, X, y, epochs=1000):
    for epoch in range(epochs):
        y_pred = self.forward(X)
            
        loss = cross_entropy_loss(y, y_pred)
            
        self.backward(X, y, y_pred)
            
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}")

Predicion Method

Using the trained neural network, the prediction method uses fresh input data to provide class predictions.

def predict(self, X):
    y_pred = self.forward(X)
    return np.argmax(y_pred, axis=1)

Step 5. Implementing the Network and Training Process

This section explores the neural network’s training procedure and real-world application. Using NumPy, we will construct a neural network from the ground up, going over all the important parts including backpropagation, forward propagation, and the training loop. These procedures will help us train the network to identify patterns in the data and improve its performance through iterative updates, resulting in a fully functional model that can predict the future with accuracy.

Loading Data

Firstly we need to load the training and testing data

X_train, y_train, X_test, y_test = load_data('/kaggle/input/mnist-in-csv/mnist_train.csv', '/kaggle/input/mnist-in-csv/mnist_test.csv')

One Hot Encoding of Labels

The class labels y_train and y_test are transformed into one-hot encoded vectors by the one_hot_encode function.

y_train_encoded = one_hot_encode(y_train, 10)
y_test_encoded = one_hot_encode(y_test, 10)

Initialization of Parameters

input_size = 784 
hidden_size = 64 
output_size = 10 
learning_rate = 0.1

Network Initialization and Training

We will be creating an instance of ‘NeuralNetwork’ class and will train the network based on training data

nn = NeuralNetwork(input_size, hidden_size, output_size, learning_rate)
nn.train(X_train, y_train_encoded, epochs=1000)

y_test_pred = nn.predict(X_test)
test_accuracy = accuracy(y_test_encoded, one_hot_encode(y_test_pred, 10))
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Step 6. Output

After training process, the results after each epoch looked something like this:

From 2.3044 at epoch 0 to 0.5552 at epoch 900, the model’s loss clearly decreases, suggesting that learning and convergence were successful over time. An efficient and reliable training procedure is suggested by the loss decreasing steadily without experiencing significant swings. Although there may still be space for improvement, the model generalises to previously unseen data very effectively, as evidenced by the final test accuracy of 87.49%. The model may benefit from additional training or optimisation strategies such early stopping, learning rate tweaking, or regularisation given that the loss keeps going down slightly towards the end. This could increase the model’s convergence speed and accuracy.

The picture displays the anticipated results of a handwritten digit recognition model, most likely from the MNIST dataset. The genuine label (genuine) and the expected label (Pred) are shown for every image. There are clear misclassifications even though many forecasts (such as True: 1, Pred: 1 and True: 2, Pred: 2) are accurate. For example, a “9” is mispredicted as a “4”, and a “1” is misinterpreted as a “8.” These mistakes imply that while the model works well overall, it may have trouble with some digits because of identical features or noisy inputs. This suggests that there is still space for improving the accuracy of the model.

Loss Graph

The network is learning and getting better over time, as seen by the training loss graph’s consistent decline. There isn’t any obvious evidence of overfitting because the loss keeps going down without levelling off, but we can’t be sure of this without a validation loss curve. Although the loss does not appear to be flattening and the network seems to be improving, this could be a sign of underfitting or that the network is not yet fully capturing the patterns in the data.

The whole code is available here:

https://www.kaggle.com/code/akashnath29/handwritten-digit-recognition-using-numpy

This blog describes how we used NumPy to create and train a neural network from scratch for digit recognition on the MNIST dataset. With 87.49% test accuracy at the end and a steadily declining loss across training, the model showed good learning. The loss graph indicated a consistent progress without any fluctuations, pointing to a trustworthy training procedure. The odd misclassifications, however, point to areas that could be optimised, including investigating early stopping or modifying learning rates. All things considered, this study sheds light on the possibilities and difficulties of building neural networks from the ground up and provides guidance on machine learning performance tweaking and model enhancement.