Everything should be made as simple as possible, but not simpler. (Albert Einstein)

Wednesday, March 4, 2015

Finding Good Lambda for Handwritten Digits Recognition (Neural Network) with Cross Validation Set

Finding good lambda ( λ ) for regularization in a machine learning model is important, to avoid under-fitting (high bias) or over-fitting (high variance).

If lambda is too large, then all theta ( θ ) values will be penalized heavily. Hypothesis ( h ) tends to zero. (High bias, under-fitting).
If lambda is too small, that's similar to very small regularization. (High variance, over-fitting).

Cross validation set principle can be used to select good lambda based on the plot of errors vs lambda, for both training data and validation data.

In this case, data is splitted into two parts, training data and validation data (or three parts with testing data).

This is an example of the plots for the adaptation of MNIST handwritten digits database, with 5000 samples, images is scaled down to 20x20. Model has 1 hidden layer of 25 hidden units, with 10 labels at output. (see Stanford's Machine Learning course by Prof. Andrew Ng.)

The code does k=10 repetition of cross validation procedure, each with vector of some lambda values. The model is trained using different values of lambda, then to compute training error and cross validation error by calling the cost function with lambda=0.
We take mean values of the 10 repetition results, to be plotted. 

lambda_vec = [0 0.001 0.003 0.01 0.03 0.1 0.3 1 2 3 4 5 6 7 8 9 10]';

Zoomed in,
lambda_vec = [0 0.001 0.004 0.007 0.01 0.03 0.05 0.07 0.1 0.15 0.2 0.25...
              0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8 0.9 1 2 3]';

Good lambda value for good fitting is around 0.3, that is when cross-validation curve has local minimum.

Octave code, 

%% ./04/mlclass-ex4-008/mlclass-ex4/validationCurve.m
% repetition
num_rep = 10

% Selected values of lambda
%lambda_vec = [0 0.001 0.003 0.01 0.03 0.1 0.3 1 2 3 4 5 6 7 8 9 10]';
lambda_vec = [0 0.001 0.004 0.007 0.01 0.03 0.05 0.07 0.1 0.15 0.2 0.25...
              0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8 0.9 1 2 3]';

all_error_train = zeros(length(lambda_vec), num_rep);
all_error_val = zeros(length(lambda_vec), num_rep);

for k = 1:num_rep
    fprintf('--- REP: %d ---\n', k);

    % data stored in arrays X, y 
    load('ex4data1.mat'); 
    m = size(X, 1);

    % Load initial weights into variables Theta1 and Theta2 
    load('ex4weights.mat');
    % Unroll parameters 
    initial_nn_params = [Theta1(:) ; Theta2(:)];

    input_layer_size  = 400;  % 20x20 Input Images of Digits
    hidden_layer_size = 25;   % 25 hidden units
    output_neurons = numel(unique(y));   % 10 labels 

    % attach target/output columns
    X = [X y];

    % randomize order of rows, to blend the data
    X = X(randperm(size(X,1)), :);

    % get back re-ordered y
    y = X(:, end);
    % remove y
    X(:, end) = [];

    % split into train (80%), validation (20%) 
    idx_mid = 0.8 * m; 
    X_train = X(1:idx_mid, :);
    y_train = y(1:idx_mid, :);
    X_vali = X(idx_mid+1:end, :);
    y_vali = y(idx_mid+1:end, :);

    error_train = zeros(length(lambda_vec), 1);
    error_val = zeros(length(lambda_vec), 1);

    % suppress warning, or to change all | and & in fmincg to || and && 
    warning('off', 'Octave:possible-matlab-short-circuit-operator');

    for i = 1:length(lambda_vec)
        % Create "short hand" for the cost function to be minimized
        costFunction = @(p) nnCostFunction(p, ...
                                       input_layer_size, ...
                                       hidden_layer_size, ...
                                       output_neurons, X_train, y_train, lambda_vec(i));

        options = optimset('MaxIter', 50);   % around 95% accuracy
                                       
        % Now, costFunction is a function that takes in only one argument (the
        % neural network parameters)
        [nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

        [error_train(i), grad_train] =...
            nnCostFunction(nn_params, input_layer_size, hidden_layer_size,...
            output_neurons, X_train, y_train, 0); 
        [error_val(i), grad_val] =...
            nnCostFunction(nn_params, input_layer_size, hidden_layer_size,...
            output_neurons, X_vali, y_vali, 0); 
    endfor

    all_error_train(:, k) = error_train;
    all_error_val(:, k) = error_val;
endfor

mean_error_train = mean(all_error_train, 2);
mean_error_val = mean(all_error_val, 2);

plot(lambda_vec, mean_error_train, lambda_vec, mean_error_val);
legend('Train', 'Cross Validation');
xlabel('lambda');
ylabel('Error');
The nnCostFunction is cost function of the neural network model (part of the week 5 assignment Ex4), fmincg is adaptation of Octave's fminunc (to solve unconstrained optimization problem, finding local minimum).
 
Basically, the same cross validation principle can also be applied to other linear / logistic regression models.
-------