\[\begin{equation} \tag{1} (\mathbf{A B})^{-1}=\mathbf{B}^{-1} \mathbf{A}^{-1} \end{equation}\]

\[\begin{equation} \tag{2} (\mathbf{A B C} \ldots)^{-1}=\ldots \mathbf{C}^{-1} \mathbf{B}^{-1} \mathbf{A}^{-1} \end{equation}\]

\[\begin{equation} \tag{3} \left(\mathbf{A}^{T}\right)^{-1}=\left(\mathbf{A}^{-1}\right)^{T} \end{equation}\]

\[\begin{equation} \tag{4} (\mathbf{A}+\mathbf{B})^{T}=\mathbf{A}^{T}+\mathbf{B}^{T} \end{equation}\]

\[\begin{equation} \tag{5} (\mathbf{A B})^{T}=\mathbf{B}^{T} \mathbf{A}^{T} \end{equation}\]

\[\begin{equation} \tag{6} (\mathbf{A B C} \ldots)^{T}=\ldots \mathbf{C}^{T} \mathbf{B}^{T} \mathbf{A}^{T} \end{equation}\]

Notes

  • Eigen-decomposition \[S = \sum_{k=1}^N \rho_ku_ku_k^T\]

1. Neural Networks

Feedforward and const function

The cost function for the neural network (without regularization) is

\[ J(\theta) = \frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K \bigg[ −y_k^{(i)}\log((h_{θ}(x^{(i)}))_k)−(1−y_k^{(i)})\log(1−(h_θ(x^{(i)}))_k) \bigg] \],

where \(h_{\theta}(x^{(i)})\) is computed as shown in the Figure 2 and \(K = 10\) is the total number of possible labels. Note that \(h_θ(x^{(i)})_k = a^{(3)}_k\) is the activation (output value) of the \(k\)-th output unit.

Implementation-nnCostFunction.m

a1 = [ones(m, 1) X];
z2 = a1*Theta1';
a2 = [ones(size(z2, 1), 1) sigmoid(z2)];
z3 = a2*Theta2';
a3 = sigmoid(z3);

yd = eye(num_labels);
y = yd(y,:);

log_dif = -log(a3).*y-log(1-a3).*(1-y);
J=sum(log_dif(:))/m;

Regularized const function

The cost function for neural networks with regularization is given by

\[ J(\theta) = \frac{1}{m}\sum_{i=1}^m\sum_{k=1}^K \bigg[ −y_k^{(i)}\log((h_{θ}(x^{(i)}))_k)−(1−y_k^{(i)})\log(1−(h_θ(x^{(i)}))_k) \bigg] \]

\[ + \frac{\lambda}{2m}\bigg[\sum_{j=1}^{25}\sum_{k=1}^{400}(\theta_{j,k}^{(1)})^2+\sum_{j=1}^{10}\sum_{k=1}^{25}(\theta_{j,k}^{(2)})^2\bigg] \]

Implementation-nnCostFunction.m

a1 = [ones(m, 1) X];
z2 = a1*Theta1';
a2 = [ones(size(z2, 1), 1) sigmoid(z2)];
z3 = a2*Theta2';
a3 = sigmoid(z3);

yd = eye(num_labels);
y = yd(y,:);

log_dif = -log(a3).*y-log(1-a3).*(1-y);

Theta1s=Theta1(:,2:end);
Theta2s=Theta2(:,2:end);

penalty = lambda/(2*m)*(sum((Theta1s.*Theta1s)(:)) + sum((Theta2s.*Theta2s)(:)));
J=sum(log_dif(:))/m + penalty;

2. Backpropagation

Install Octave

Download Octave from https://ftp.gnu.org/gnu/octave/windows/, and register its folder to environment variable Path.

Create build system for Octave file in Sublime Text3

Octave.sublime_build

{
    "cmd": ["octave-gui", "$file"],
    "shell": true   // to show plots
}

Create short-cut for canceling build

Add the line to Preferencesbindings

{ "keys": ["ctrl+shift+b"], "command": "exec", "args": {"kill": true} },

Cost function, gradient of regularized logistic regression for multi-class classification are similar to exercise 2.

This exercise implement one-vs-all classification by training multiple regularized logistic regression classifiers, one for each of the K classes in our dataset.

oneVsAll.m

function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta 
%corresponds to the classifier for label i
%   [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels
%   logisitc regression classifiers and returns each of these classifiers
%   in a matrix all_theta, where the i-th row of all_theta corresponds 
%   to the classifier for label i

% Some useful variables
m = size(X, 1);
n = size(X, 2);

% You need to return the following variables correctly 
all_theta = zeros(num_labels, n + 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

% ====================== YOUR CODE HERE ======================
% Instructions: You should complete the following code to train num_labels
%               logistic regression classifiers with regularization
%               parameter lambda. 
%
% Hint: theta(:) will return a column vector.
%
% Hint: You can use y == c to obtain a vector of 1's and 0's that tell use 
%       whether the ground truth is true/false for this class.
%
% Note: For this assignment, we recommend using fmincg to optimize the cost
%       function. It is okay to use a for-loop (for c = 1:num_labels) to
%       loop over the different classes.
%
%       fmincg works similarly to fminunc, but is more efficient when we
%       are dealing with large number of parameters.
%
% Example Code for fmincg:
%
%     % Set Initial theta
%     initial_theta = zeros(n + 1, 1);
%     
%     % Set options for fminunc
%     options = optimset('GradObj', 'on', 'MaxIter', 50);
% 
%     % Run fmincg to obtain the optimal theta
%     % This function will return theta and the cost 
%     [theta] = ...
%         fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ...
%                 initial_theta, options);
%

for i = 1:num_labels
    initial_theta = zeros(n+1, 1);
    options = optimset('GradObj', 'on', 'MaxIter', 50);
    [theta] = fmincg(@(t)(lrCostFunction(t, X, (y == i), lambda)), ...
        initial_theta, options);
    all_theta(i,:) = theta';
end

% =========================================================================

end

predictOneVsAll.m

function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels 
%are in the range 1..K, where K = size(all_theta, 1). 
%  p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
%  for each example in the matrix X. Note that X contains the examples in
%  rows. all_theta is a matrix where the i-th row is a trained logistic
%  regression theta vector for the i-th class. You should set p to a vector
%  of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
%  for 4 examples) 

m = size(X, 1);
num_labels = size(all_theta, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
%               your learned logistic regression parameters (one-vs-all).
%               You should set p to a vector of predictions (from 1 to
%               num_labels).
%
% Hint: This code can be done all vectorized using the max function.
%       In particular, the max function can also return the index of the 
%       max element, for more information see 'help max'. If your examples 
%       are in rows, then, you can use max(A, [], 2) to obtain the max 
%       for each row.
%       

all_ps = sigmoid(X*all_theta');
[p_max,i_max] = max(all_ps, [], 2);
p = i_max;

% =========================================================================

end