[Personal Notes]: Handwritten Digits Recognition, Experiment with Octave's Neural Network Package "nnet", and RSNNS

This is a note on implementation of handwritten digits recognition, with the neural network learning process, by using Octave nnet package (or MATLAB neural network toolbox).

At the end, I play around with R code and RSNNS library (Stuttgart Neural Network Simulator for R).

GitHub, Octave/MATLAB:
https://github.com/flyingdisc/handwritten-digits-recognition-octave-nnet
Github, R - RSNNS:
https://github.com/flyingdisc/handwritten-digits-recognition-RSNNS
----
No doubt that Coursera's "Machine Learning", taught by Stanford's legendary Prof. Andrew Ng, is having most chic, elegant, ingenious (video) lectures on the topics. He definitely knows how to teach.

Rumour says 8th session of the course (2015) will be the last session which offers SoA/certificate, it will soon be converted to "On Demand" format without SoA.

----
One of interesting topic is ANN (artificial neural network). This statistical learning algorithm is inspired by biological neural network (i.e. brain). There are tons of online articles/tutorials on this topic, as well various books... =)

Application of ANN includes both regression and classification (single/multi-class), e.g. pattern/image/speech recognition, self-driving car, & other computer vision applications, also in trading e.g. gold price prediction, forex, stock prediction, etc. until simpler rule-based programming (e.g. digital system).

Prof. Ng uses MNIST handwritten digits database already packed in ".mat" file, so that we simply need to load it into Octave/MATLAB workspace. The database file is available in the course, as well there are so many sharing in github by students (which actually breaking the Honour Code =D~).

In this case, the aim of the Neural Network learning code can be outlined as,
1. To build ANN model for handwritten digits recognition,
2. To train the ANN model with the sample data,
3. To validate the model with validation data,
4. To test the trained model with testing handwritten digits (prediction).

The well-trained ANN model should then correctly recognize arbitrary digit. By comparing test input with the model's output after prediction, we have training set accuracy, which is around 95% (50 epochs) until 99% (500 epochs), with 5000 sample data.

Chosen model has 400 neurons (nodes) for input layer, 25 neurons for hidden layer, 10 neurons for output/target layer.

In the Prof. Ng's course, we build our own ANN model along with all the code in Octave/MATLAB.
Exercise 3 implements one-vs-all logistic (sigmoid) regression for the ANN. Exercise 4 implements backpropagation algorithm for the ANN.

------
In this note, I want to write down my experiment with Octave's nnet package (or similarly MATLAB's Neural Network Toolbox).

Octave has limitation in the nnet compared to MATLAB. We for sure understand this, as Octave is open source supported by community. Until this note is written, Octave nnet has only trainlm for the training function, while MATLAB has rich options include trainlm, traingd, traingdm, traingdx, trainrp, traincgf, ... and many others.

Also, until the date this note is written, Octave nnet has only learngdm for the backprop weight/bias learning function. While MATLAB has learncon, learngd, learngdm, learnp, learnpn, learnwh.

More detail, for sure is in their respective documentation =)

One big problem when training model with big matrices/vectors in Octave/MATLAB, they are very memory hog, also they are much slower when using package/toolbox compared to our own-developed code.

For memory reason and timely matter, in this case I limit the the hidden layer down to 16 nodes, and training data to only 120 samples.
(compare to our own course's code which can handle 25 nodes at hidden layer and 5000 sample data, in much faster execution time.)

This is the model illustration (captured from MATLAB), 400 nodes at input layer, 16 nodes at for hidden layer, 10 nodes at target layer. Amount of nodes can be easily modified to adopt bigger model, by just replacing some constant values in the code.

Code for Octave:


pkg load nnet;

load('ex4data1.mat'); % training data stored in arrays X, y

% hidden neurons, output neurons
hidden_neurons = 16; 
output_neurons = numel(unique(y));   % 10 

epochs = 200;
train_size = 120;

% train, test, validation indices
ntrain = 0;
idx_train = ntrain+1:ntrain+train_size;
ntest = 4000;
idx_test = ntest+1:ntest+train_size;
nvali = 4800;
idx_vali = nvali+1:nvali+train_size;

% classification (instead of regression), so.. 
% convert 1 column to 10 columns target with binary 0/1 values.
y10 = zeros(size(y,1), output_neurons);
y10( sub2ind(size(y10), [1:numel(y)]', y) ) = 1;

% attach target columns, and temporary y
X = [X y10 y];

% randomize order of rows, to blend the training data
X = X( randperm(size(X,1)),: );
% get back re-ordered y
y = X(:,end);
% remove y
X(:,end) = [];
% row based
X = X';

% create train, test, validate data, randomized order
fprintf('\nSUBSETTING train, test, validate data.\n');
fflush(stdout);
%[X_train, X_test, X_vali] = subset(X', output_neurons, 1);
X_train = X(:, idx_train);
X_test = X(:, idx_test);
X_vali = X(:, idx_vali);
y_test = y(idx_test);   % same indices as X_test

% indices of input & output data
idx_in = size(X_train,1) - output_neurons;
idx_out = idx_in + 1;

% feed forward network, 
fprintf('\nCREATE feed forward network...\n');
fflush(stdout);
R = min_max(X_train(1:idx_in,:));    % Rx2
S = [hidden_neurons output_neurons]; 
net = newff(R, S, {'logsig', 'purelin'}, 'trainlm', 'learngdm', 'mse'); 

% create randomized weights for symmetry breaking.
epsilon_init = 0.12;
InW = rand(hidden_neurons, size(R, 1)) * 2 * epsilon_init - epsilon_init;
LaW = rand(output_neurons, hidden_neurons) * 2 * epsilon_init - epsilon_init;

net.IW{1, 1} = InW;
net.LW{2, 1} = LaW;
net.b{1, 1}(:) = 1;
net.b{2, 1}(:) = 1;

net.trainParam.epochs = epochs;

saveMLPStruct(net, "MLPstruct_digit.txt");

## define validation data new, for matlab compatibility
VV.P = X_vali(1:idx_in,:);
VV.T = X_vali(idx_out:end,:);

% train
fprintf('\nTRAIN the network...\n');
fflush(stdout);
net_train = train(net, X_train(1:idx_in,:),...
    X_train(idx_out:end,:), [], [], VV);

% simulation
fprintf('\nSIMULATION...\n');
fflush(stdout);
sim_out = sim(net_train, X_test(1:idx_in,:));

sim_out1 = round(sim_out);

% convert back 10 outputs to 1
[val,idx] = max(sim_out);

fprintf('\nTraining Set Accuracy: %f\n', mean(double(idx' == y_test)) * 100);
fflush(stdout);

Code for MATLAB:


load('ex4data1.mat'); % training data stored in arrays X, y

% hidden neurons, output neurons
hidden_neurons = 16; 
output_neurons = numel(unique(y));   % 10

epochs = 200;
train_size = 80;

% train, test, validation indices
ntrain = 0;
idx_train = ntrain+1:ntrain+train_size;
ntest = 4000;
idx_test = ntest+1:ntest+train_size;
nvali = 4800;
idx_vali = nvali+1:nvali+train_size;

% classification (instead of regression), so.. 
% convert 1 column to 10 columns target with binary 0/1 values.
y10 = zeros(size(y,1), output_neurons);
y10( sub2ind(size(y10), [1:numel(y)]', y) ) = 1;

% attach target columns, and temporary y
X = [X y10 y];

% randomize order of rows, to blend the training data
X = X( randperm(size(X,1)),: );
% get back re-ordered y
y = X(:,end);
% remove y
X(:,end) = [];
% row based
X = X';

% create train, test, validate data, randomized order
fprintf('\nSUBSETTING train, test, validate data.\n');
drawnow('update');
%[X_train, X_test, X_vali] = subset(X', output_neurons, 1);
X_train = X(:, idx_train);
X_test = X(:, idx_test);
X_vali = X(:, idx_vali);
y_test = y(idx_test);   % same indices as X_test

% indices of input & output data
idx_in = size(X_train,1) - output_neurons;
idx_out = idx_in + 1;

% feed forward network, 
fprintf('\nCREATE feed forward network...\n');
drawnow('update');
R = minmax(X_train(1:idx_in,:));    % Rx2
S = [hidden_neurons output_neurons]; 
net = newff(R, S, {'logsig', 'purelin'}, 'trainlm', 'learngdm', 'mse'); 

% create randomized weights for symmetry breaking.
epsilon_init = 0.12;
InW = rand(hidden_neurons, size(R, 1)) * 2 * epsilon_init - epsilon_init;
LaW = rand(output_neurons, hidden_neurons) * 2 * epsilon_init - epsilon_init;

net.IW{1, 1} = InW;
net.LW{2, 1} = LaW;
net.b{1, 1}(:) = 1;
net.b{2, 1}(:) = 1;

net.trainParam.epochs = epochs;

% define validation data new, for matlab compatibility
VV.P = X_vali(1:idx_in,:);
VV.T = X_vali(idx_out:end,:);

% train
fprintf('\nTRAIN the network...\n');
drawnow('update');
[net_train, tr] = train(net, X_train(1:idx_in,:),...
    X_train(idx_out:end,:), [], [], VV);

% simulation
fprintf('\nSIMULATION...\n');
drawnow('update');
sim_out = sim(net_train, X_test(1:idx_in,:));

sim_out1 = round(sim_out);

% convert back 10 outputs to 1
[val,idx] = max(sim_out);

fprintf('\nTraining Set Accuracy: %f\n', mean(double(idx' == y_test)) * 100);
drawnow('update');

The data "ex4data1.mat" is available in the course class, also can be search in github.
To change amount of hidden layer neurons/nodes, replace the "hidden_neurons = 16" to somewhat higher/smaller value. Just remember, too many nodes may result in overfitting. Prof. Ng chose 25.
We should not change "output_neurons", it's fixed = 10.
Amount of epochs can be changed in the "epochs" variable.
Training sample size can be changed in the "train_size" variable. Please remember, total amount of sample data is 5000. While default value for test and validation data is started from 4000 and 4800 in the code above. So if we want train_size to be larger than 200, than we need to change "ntest" and "nvali" accordingly, so that they won't exceed the size.

In the code we use randomized initial weights for symmetry breaking. Please note that our course's code uses trained weights.

For newer MATLAB, warning maybe thrown to complain "newff()" as it uses obsolete format. In this case we need to refer to the latest documentation of the Neural Network Toolbox.

ANN's cost function is not convex / concave. So the converged cost solution in ANN may not refer to global minimum, there may be some local minima. The accuracy could vary.

Result:
Need much larger amount of memory allocation, and much slower execution time (training), compared to the our course's code.

Training set accuracy is around 63% with the small sample (120) and small hidden neurons (16).
The course's code yields 74% accuracy, with the same small sample (120) and small hidden neurons (16).
While, larger sample (5000), and more hidden neurons (25) yields 95% for 50 epochs, and 99% for 500 epochs.
---

R implementation:

R implementation with RSNNS library is very memory-efficient and fast.
It can handle 5000 sample with 25 hidden neurons with only around 100MB of memory allocation, in several seconds of execution time.


library(RSNNS)
library(R.matlab)

# import matlab handwritten data sample, df is list of matrices
df <- readMat('../mlclass-ex4/ex4data1.mat')

# how many sample? 
#num_sample <- 200
# nodes at hidden layer
hidden_neurons <- 25
# iteration / epochs
epochs <- 50

# shuffle the sample matrices (input & target) by row
dfx <- transform(df$X, y=df$y)
dfx <- dfx[ sample(1:nrow(dfx), nrow(dfx)), ]
# get back y as target
dfy <- matrix(dfx$y, ncol=1)
# remove y from X
dfx$y <- NULL

# decode 1 column y to 10 columns
dfy <- decodeClassLabels(dfy)

# split sample for training and test (15%)
df <- splitForTrainingAndTest(dfx, dfy, ratio=0.15)

# normalize (no need in this case, we have uniform sample)
# df <- normTrainingAndTestSet(df)

# remove unused vars from environment
remove(dfx)
remove(dfy)

# create & train feed-forward neural network, 
#    the mlp (multilayer perceptrons)
# learnFunc: Std_Backpropagation, BackpropBatch, BackpropChunk, 
#    BackpropMomentum, BackpropWeightDecay, Rprop, Quickprop, SCG
model <- mlp(x=df$inputsTrain, y=df$targetsTrain, 
             size=hidden_neurons, maxit=epochs, 
             initFunc="Randomize_Weights", 
             initFuncParams=c(-0.3, 0.3),
             learnFunc="Std_Backpropagation", 
             learnFuncParams=c(0.2, 0),
             updateFunc="Topological_Order", 
             updateFuncParams=c(0),
             hiddenActFunc="Act_Logistic", 
             shufflePatterns=TRUE, linOut=FALSE,
             inputsTest = df$nputsTest, targetsTest=targetsTest)
             
par(mfrow=c(2,2))
png(filename="plotIterativeError.png", height=480, width=480)
plotIterativeError(model)
dev.off()

# prediction
predictions <- predict(model, df$inputsTest)

# decode 10-columns-numeric ~ 1-column ~ 10-columns-binary
pred_output <- max.col(predictions, "last")
pred_output <- decodeClassLabels(pred_output)
print( paste("Training Set Accuracy: ", 
       mean(as.numeric(pred_output == df$targetsTest)) * 100) )
# [1] "Training Set Accuracy:  98.4"
# [1] "Training Set Accuracy:  99.0666666666667"

png(filename="plotRegressionError.png", height=480, width=480)
plotRegressionError(df$targetsTest[,2], predictions[,2])
legend(x="bottomright", legend=c("optimal", "linear fit"), 
       col=c("black", "red"), lwd=c(2, 2))
dev.off()

confusionMatrix(df$targetsTrain, fitted.values(model))

confusionMatrix(df$targetsTest, predictions)

png(filename="plotROCCurveTrain.png", height=480, width=480)
plotROC(fitted.values(model)[,2], df$targetsTrain[,2])
dev.off()

png(filename="plotROCCurveTest.png", height=480, width=480)
plotROC(predictions[,2], df$targetsTest[,2])
dev.off()

Result:
[1] "Training Set Accuracy: 98.4"
[1] "Training Set Accuracy: 99.0666666666667"
Very good accuracy of 98 - 99%.

Confusion matrix for test sample,
       predictions
targets 1 2 3 4 5 6 7 8 9 10
     1 75 0 1 0 0 0 0 0 0 0
     2   1 71 3 1 3 1 1 3 0 0
     3   0 2 78 0 2 0 1 2 0 0
     4   2 0 1 73 0 0 0 0 1 0
     5   0 1 2 0 53 0 0 1 0 1
     6   1 4 0 0 2 62 0 0 0 2
     7   2 0 0 0 0 0 68 0 0 1
     8   1 2 1 0 3 1 0 62 2 1
     9   0 0 0 1 0 0 1 1 76 0
     10 0 0 0 0 1 0 0 0 0 75

We expect a matrix with only diagonal has non-zero values (identity). Non diagonal values show the accuracy error.

Plot of iterative error,