So I was playing with Pytorch over the weekend, trying to get my fundamentals right by following along to the tutorials in the documentations. In my first exercise, I wanted to try to implement a shallow neural network with 1 hidden layer (784-4-10) to try to classify images from the FashionMNIST dataset. I have learned quite a bit about what’s needed to get it right.

Model illustration generated by ChatGPT:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Input Layer (784)                Hidden Layer                Output Layer (10)
 ------------------             -----------------            ------------------
| x1  x2  x3  ... x784 |   --> |  h1   h2   h3  ... hn | --> |  y1 y2 y3 ... y10 |
 ------------------             -----------------            ------------------

Abstract Visualization:

   [x1]   [x2]   [x3]    ...   [x784]
      \     |     /    ...       /
       \    |    /    ...      /
        \   |   /    ...     /
         \  |  /    ...    /
        ---------------------
        |   Hidden Layer    |
        |   h1 h2 ... hn    |
        ---------------------
              /  |  \
             /   |   \
            /    |    \
           /     |     \
        ---------------------
        |   Output Layer    |
        |   y1 ... y10      |
        ---------------------

In this post, I’ll be documenting my learnings.

1. You need to be careful about what you’re passing into the nn.CrossEntropyLoss() function.

According to the docs, this function expects logits. The data that I passed into this function, however, was already the softmaxed probability distributions across 10 classes. As such, I encountered a problem where the model was simply not training (i.e. loss was not decreasing).

The fix for this was also simple: in the init of your network, just return the logits instead of the probability distribution.

Once that was tweaked, training actually had an effect, and I saw a huge bump in accuracy from 19% to 39% by the end of the 5th epoch.

2. 4 nodes in the first hidden layer might not be sufficient in capturing important information, but 32 nodes is too much.

I was wondering if a 39% accuracy is the best that I can go. I decided to change the number of nodes in the first hidden layer from 4 to 16 - and the accuracy jumped from 39% to 65%! It’s amazing how just adding more nodes to the hidden layer allowed us to get better accuracy.

I then doubled it to 32, but there was a marginal (less than 1%) increase in accuracy. So there’s diminishing returns for num of nodes in first layer.

3. Having more epoch will work if convergence is not happening yet.

I then doubled the number of epochs that it trained on - from 5 to 10 - and the accuracy increased from 66% to 71%! A 5% jump in accuracy. But it’s definitely slowing down.

And after 20 epochs, I see an accuracy of 79.6%. So it’s not too bad, the model is still slowly converging.

4. Core learning: 80% accuracy seems to be possible for me to achieve with a shallow neural network with 16 nodes, for an image classification task.

I’ll be using this then as the benchmark.

5. The training process on my Mac’s mps is surprisingly fast benchmarked to Google Colab

Apparently, this is because T4 GPU is defined for large-scale workloads (lots of data, big models, large batch sizes), and a shallow neural network like mine don’t fully utilize its thousands of CUDA cores. The local MPS backend could therefore avoid a lot of overhead and finish faster.

There is also a difference between latency and throughput. T4 excels in throughput (handling huge parallel computations) - whereas MPS on Apple Silicon excels in latency for small jobs, as it’s tightly integrated with CPU and unified memory. Therefore, for small networks, the overhead of sending data on the T4 is bigger than the benefits of GPU parallelism.

6. Adding one more hidden layer did not help improve the accuracy

I inserted another hidden layer of dimensions (16x16), but the accuracy was still at 78% after 20 epochs of training.

Quite an interesting session.

Next up, I’ll want to see how this compares to CNNs.

Full Code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
#%%
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

training_images = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
)
classes = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

#%%
batch_size = 64
train_dataloader= DataLoader(training_data, batch_size=batch_size)
test_dataloader= DataLoader(test_data, batch_size=batch_size)

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"

class ShallowNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, 16)
        self.fc2 = nn.Linear(16, 10)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        # print("Performing Forward pass...")
        # print("Receiving input of shape:", x.shape)
        x = self.flatten(x)
        # print("Flattening:", x.shape)
        Z1 = self.fc1(x)
        # print("Z1:", Z1.shape)
        A1 = torch.relu(Z1)
        # print("A1:", A1.shape)
        logits = self.fc2(A1)
        # print("logits:", logits)
        # prob_dist = self.softmax(logits)
        return logits

predictor = ShallowNetwork().to(device)
#%%
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(predictor.parameters(), lr=1e-3)
#%%
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

def test(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

epochs = 20
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, predictor, loss_fn, optimizer)
    test(test_dataloader, predictor, loss_fn)
print("Done!")