I wanted to see what are the effects of training some simple models on this dataset, namely, a Shallow 1-layer neural network, a deep 3-layer neural network, a variant of the deep neural network with wider dimensions, as well as a simple 1-layer CNN.

The results of the models are as follows:

ModelDimensionsAccuracy
Shallow NN1678.4%
Deep 3-layer NN8-8-879.2%
Deep 3-layer NN16-12-880.6
CNN882.6%

These corresond to models #1-4 on this page: https://models.minimumloss.xyz/

I’ll note my main observations below:

1. There’s no significant difference in accuracy between the three models

All models reached around 80% accuracy; although the CNN is somewhat better.

2. The shallow neural network converged to a high level of accuracy faster than the deeper ones:

This is the loss chart for the shallow NN: Shallow NN errors Fig 1: Shallow NN

After 20 epochs, it managed to converge to 79% accuracy.

Conversely, if we take a look at the loss chart for the deep NN (model 2):

Fig 2: DeepNN (model 2)

We see that after 20 epochs, we’re still at around 60% accuracy - it would take us 80 epochs (4 times the training resources) than the shallow network to reach the same level of performance of around 80%.

3. The deeper networks tend to be a bit more unstable at the start:

Observe fig 2 above: we see that at the start, the accuracy would tend to oscillate. Deeper networks tend to suffer from the cold start problem, where loss would go up and down before finally converging.

4. We don’t see signs of overfitting on all three:

In all of the models, we see that test loss and train loss doesn’t diverge. Which means that our model is extrapolating well to unseen data.

5. The CNN converges really fast

After the first epoch, the CNN model is already seeing 62% accuracy - which means that it’s converging and learning really fast. In fact, it reaches 80% accuracy with just 13 epochs. For the rest of the epochs, it’s already making small adjustments and learning more subtle things to increase accuracy.

Fig 3: SimpleCNN

Comparing it to the deep NN, the CNN takes less than 5 epochs to reach >70% accuracy, whereas the deep NN needs 40 epochs!

6. For CNN, we see signs of overfitting the longer we train it

Referring to fig 3, towards the 35-40th epoch, we see that there is a divergence between test and train results. Specifically, test errors are constantly decreasing (we’re modelling the training data better and better) with no accompanying generalizability to the test data.