In the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution. The means are the values of the model parameters after training on the first dataset, and the precision (inverse of variance) is the values of the diagonals along the Fisher information matrix.
They created a new regularization method called elastic weight consolidation (EWC). Here it is:
train on the first dataset as normal
compute the Fisher information matrix
when training on a second dataset
, use the following updated loss function.
is a hyperparameter,
is the
th element on the diagonal of the information matrix, and
is the
th model parameter after training on dataset
.
why the Fisher information matrix
This section borrows heavily from the Wikipedia article on Fisher information.
The score of the likelihood function of a random variable
with respect to a (possibly unknown) parameter
is
The score has the property that its expected value is zero when is at its true value. Intuitively, this means that the score is most certain (has least variance) when this hidden parameter
is at its true value. Thus the variance of the score is a measure of how much information samples from random variable
give us about the parameter
. The variance of the score is Fisher information.
The Fisher information matrix is useful in the case where is actually a vector of parameters, as in our neural network. The matrix is like a covariance matrix of the score:
It’s not so important to understand the matrix to understand EWC, because only the trace of the matrix is used in the algorithm. The trace contains the regular Fisher information of the random variable with respect to each of the parameters of the neural network. is approximated by the model predictions over the first dataset.