Building a Neural Network from Scratch: using ONLY Numpy
Envision imparting human-like handwriting recognition skills to a machine without the need for sophisticated libraries or frameworks. Greetings from the world of pure NumPy, where we will delve into the core of neural networks and create a robust model from scratch for digit recognition. See the wonders of creating a neural network without TensorFlow, PyTorch, or Keras in this blog. We’ll walk you through each step, from setting up the network to optimising its functionality, all the while revealing the complex inner workings. Are you prepared to witness how a few lines of code can convert unprocessed data into insightful forecasts? Fast-forward to the intriguing building and training of neural networks by fastening your seatbelt!
Neural Networks
A subclass of machine learning algorithms called neural networks is motivated by the composition and operation of the human brain. They are made up of layers of networked nodes, or neurones, each of which processes incoming data and transmits the output to the one above it. These layers use a number of mathematical processes, such as weighted sums and activation functions, to convert the input data into meaningful output.
Today we will be building a Neural Network from scratch without using Tensorflow or Pytorch library, but we will use ONLY Numpy. I will be creating a handwritten digit recognition network using MNIST CSV dataset by Dariel Dato-on on Kaggle, check the dataset out:
https://www.kaggle.com/datasets/oddrationale/mnist-in-csv
Requirements:
- Basic Python
- Basic Numpy
And you’re good to go!!!
Building a Neural Network from scratch requires a lot of calculations. But I’ll try to keep the blog as simple as possible.
Step 1. Loading Dataset
The first step towards building a Neural Network is loading the dataset. We are importing the MNIST dataset — which includes pictures of handwritten numbers — into this block. Every image is represented by a flattened 784-element vector that is a 28x28 pixel grid. The first column of the CSV file contains the labels (digits 0–9), and the other columns include the pixel values.
def load_data(train_csv, test_csv):
train_data = pd.read_csv(train_csv)
test_data = pd.read_csv(test_csv)
X_train = train_data.drop(columns=['label']).values / 255.0
y_train = train_data['label'].values
X_test = test_data.drop(columns=['label']).values / 255.0
y_test = test_data['label'].values
return X_train, y_train, X_test, y_test
Step 2. One Hot Encoding Labels
We must convert the labels into a one-hot encoded format because we are solving a classification issue with ten potential output classes (digits 0–9). Accordingly, each label will be represented as a vector with 10 elements, with the right class being marked as 1 and the remaining classes as 0.
def one_hot_encode(y, num_classes):
return np.eye(num_classes)[y]
Step 3. Defining Activation Functions
A neural network that has activation functions can learn and describe complicated patterns in data by introducing non-linearity. Regardless of the number of layers, the network would respond like a linear regression model in the absence of activation functions, which would restrict its capacity to handle increasingly difficult tasks.
Let’s now examine more closely at the two particular activation functions we employed, Sigmoid and Softmax.
Sigmoid
Non-linearity is introduced in the hidden layer by the application of the sigmoid function. Regardless of the number of layers in the network, non-linearity is essential because without it, the network would act like a linear model.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
Softmax
In the output layer, the softmax function is employed to transform raw scores (logits) into probabilities. This is especially helpful when classifying multiple classes.
def softmax(z):
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
Step 4. Building the Network
We configure the network’s weights and biases at the initialisation stage, which is crucial for learning. To guarantee diversified learning features and break symmetry, weights (W1 and W2) are initialised with modest random values. By setting the biases (b1 and b2) to zero, the model can modify the activation function without the need for randomisation. The magnitude of weight updates is determined by the learning rate; a value of 0.1 guarantees large adjustments while accounting for the possibility of overshooting. Adequate initialisation is essential for both training and convergence to occur.
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
Forward Propagation
The neural network processes data during the forward propagation phase in order to produce predictions. Three essential phases are involved in this process: applying activation functions, computing linear combinations, and retrieving the result.
# Previous code
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = softmax(self.z2)
return self.a2
Backward Propagation
We calculate the gradients of the loss function with respect to the weights and biases during the backward propagation phase. By using this data, the parameters are changed to reduce loss and raise the accuracy of the model.
# Previous Code
def backward(self, X, y_true, y_pred):
m = X.shape[0]
dz2 = y_pred - y_true
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
dz1 = np.dot(dz2, self.W2.T) * sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W1 -= self.learning_rate * dW1
self.b1 -= self.learning_rate * db1
self.W2 -= self.learning_rate * dW2
self.b2 -= self.learning_rate * db2
Training Process
In order to reduce the loss function and raise the accuracy of the model, the neural network’s weights and biases are iteratively adjusted during the training phase. The training function is explained in detail below:
# Previous Code
def train(self, X, y, epochs=1000):
for epoch in range(epochs):
y_pred = self.forward(X)
loss = cross_entropy_loss(y, y_pred)
self.backward(X, y, y_pred)
if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
Predicion Method
Using the trained neural network, the prediction method uses fresh input data to provide class predictions.
def predict(self, X):
y_pred = self.forward(X)
return np.argmax(y_pred, axis=1)
Step 5. Implementing the Network and Training Process
This section explores the neural network’s training procedure and real-world application. Using NumPy, we will construct a neural network from the ground up, going over all the important parts including backpropagation, forward propagation, and the training loop. These procedures will help us train the network to identify patterns in the data and improve its performance through iterative updates, resulting in a fully functional model that can predict the future with accuracy.
Loading Data
Firstly we need to load the training and testing data
X_train, y_train, X_test, y_test = load_data('/kaggle/input/mnist-in-csv/mnist_train.csv', '/kaggle/input/mnist-in-csv/mnist_test.csv')
One Hot Encoding of Labels
The class labels y_train and y_test are transformed into one-hot encoded vectors by the one_hot_encode function.
y_train_encoded = one_hot_encode(y_train, 10)
y_test_encoded = one_hot_encode(y_test, 10)
Initialization of Parameters
input_size = 784
hidden_size = 64
output_size = 10
learning_rate = 0.1
Network Initialization and Training
We will be creating an instance of ‘NeuralNetwork’ class and will train the network based on training data
nn = NeuralNetwork(input_size, hidden_size, output_size, learning_rate)
nn.train(X_train, y_train_encoded, epochs=1000)
y_test_pred = nn.predict(X_test)
test_accuracy = accuracy(y_test_encoded, one_hot_encode(y_test_pred, 10))
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")
Step 6. Output
After training process, the results after each epoch looked something like this:
From 2.3044 at epoch 0 to 0.5552 at epoch 900, the model’s loss clearly decreases, suggesting that learning and convergence were successful over time. An efficient and reliable training procedure is suggested by the loss decreasing steadily without experiencing significant swings. Although there may still be space for improvement, the model generalises to previously unseen data very effectively, as evidenced by the final test accuracy of 87.49%. The model may benefit from additional training or optimisation strategies such early stopping, learning rate tweaking, or regularisation given that the loss keeps going down slightly towards the end. This could increase the model’s convergence speed and accuracy.
The picture displays the anticipated results of a handwritten digit recognition model, most likely from the MNIST dataset. The genuine label (genuine) and the expected label (Pred) are shown for every image. There are clear misclassifications even though many forecasts (such as True: 1, Pred: 1 and True: 2, Pred: 2) are accurate. For example, a “9” is mispredicted as a “4”, and a “1” is misinterpreted as a “8.” These mistakes imply that while the model works well overall, it may have trouble with some digits because of identical features or noisy inputs. This suggests that there is still space for improving the accuracy of the model.
Loss Graph
The network is learning and getting better over time, as seen by the training loss graph’s consistent decline. There isn’t any obvious evidence of overfitting because the loss keeps going down without levelling off, but we can’t be sure of this without a validation loss curve. Although the loss does not appear to be flattening and the network seems to be improving, this could be a sign of underfitting or that the network is not yet fully capturing the patterns in the data.
The whole code is available here:
https://www.kaggle.com/code/akashnath29/handwritten-digit-recognition-using-numpy
This blog describes how we used NumPy to create and train a neural network from scratch for digit recognition on the MNIST dataset. With 87.49% test accuracy at the end and a steadily declining loss across training, the model showed good learning. The loss graph indicated a consistent progress without any fluctuations, pointing to a trustworthy training procedure. The odd misclassifications, however, point to areas that could be optimised, including investigating early stopping or modifying learning rates. All things considered, this study sheds light on the possibilities and difficulties of building neural networks from the ground up and provides guidance on machine learning performance tweaking and model enhancement.