Brief overview why forgetting happens and strategies to combat it
The existing neural networks is trained on top of a useful assumption of i.i.d setting while contrasting with sequential continual learning problem setting. As a result, the neural networks trained on continual tasks setting will suffer from catastrophic interference which means the networks forget how to do previously learned tasks when they encounter new tasks. This article will dig deep into the reason for forgetting, how to measure this problem, and introduced some available approaches proposed to reducing the abandoning of prior knowledge.
[Updates]
06-05-2021: Initial article publication
How humans learn is both extremely fascinating and mysterious especially when it comes to the capability to continuously learn new knowledge and skills without forgetting the past experiences. As an example, while we observe the physics phenomena such as the gravitation mechanism and, afterward, acquire new knowledge how the chemistry works, we are able to remember what gravitation is about and explain it effortlessly. In contrast, from the learning intelligence machine perspective, deep learning scientists highly struggle to incorporate the lifelong learning ability into machine learning architecture such as neural networks.
The catastrophic forgetting or alternatively called catastrophic interference was observed initially by McColskey and Cohen
Later, this is considered as a more expanded discipline of ‘plasticity-stability dilemma’
In contrast, cognitive sciences see beyond the field as studying determining whether the earlier acquired knowledge in life is more memorized than the knowledge acquired in the coming age or called ‘The Entrenchment Effect’
While the neural networks adapt flexibly to the new incoming knowledge, it will serendipitously experience catastrophic forgetting. Conversely, networks that are prone to being unable to discriminate the new incoming inputs if the networks are extremely stable or commonly known as catastrophic remembering
Contemporarily, deep learning is trained on top of a weak but useful assumption of i.i.d (independent and identically distributed) setting which means that the data points are supposed to be mutually independent — single data is unrelated to other data point — and having similar distribution e.g. training data is assumed to have equivalent distribution to test data. Therefore, the common training setting takes the batch of samples and updates the model parameters with respect to the loss value on this batch. However, the assumption is not applicable for real-time application such as sequentially data stream training settings just like continual learning and accidentally leads to catastrophic forgetting.
Shortly, catastrophic forgetting is the radical performance drops of the model $f(X;\theta)$ which parameterized by $\theta$ with input $X$ — mostly neural networks exhibit distributed representation
Consider as an illustration on figure 1 above, our neural networks train to discriminate between two classes of cat and dog. Therefore, the network is trained on bunches of datasets containing any variants of cat and dog for some epochs. Thereafter we want our model to recognize 2 additional classes of tiger and elephant in task 2. Hence, we should train the model with task 2 dataset holding batches of samples of tiger and elephant. In the continual learning setting, we are not allowed to train the model on both task datasets and getting access to the existing dataset only — cluster of tigers and elephants images in this case. As a result, the model will update the parameters to optimizely perform good at present task or task 2 and forget how to predict the task 1 classes given task 1 dataset; therefore, reducing the performance on task 1 or called catastrophic forgetting.
Mostly the standard approach for training the neural networks model is using standard backpropagation with gradient-based optimization in particular stochastic gradient descent (SGD)
require $\eta$ for tuning the updating magnitude or called learning rate on the parameters gradient $\frac{\partial\mathcal{L}}{\partial\theta}$. However, these networks trained by gradient-based optimization algorithms are prone to encounter catastrophic forgetting. The common reason is coming from the primary factor of parameters drift while the neural networks train by taking steps to updating parameters aiming to minimize the loss on task $t$. Thanks to Masana et al
While the networks are being trained on the current task, the parameters will be tuned with respect to loss value in the current training dataset task. It means that the networks are optimized to perform maximum on the current task by changing the parameters. As a result, the optimization and parameters update will not consider previous task distribution which lead to forgetting how to do preceding tasks.
The direct ramification of parameters shifting outputs distribution deviation of the logits given the certain input e.g., image of the previous task. In the effort to alleviate this detriment, distilling the knowledge
The decision boundary adjustment leading to inter-task or inter-domain misclassification due to sequential learning setting on continual learning.
Take an example of a binary classification task — predicting whether given input X resulting discrete label 0 or 1 — as illustrated in the figure 3
How to measure catastrophic forgetting could perhaps be separated into two perspectives thus quantifying to what extent the networks model is able to acquire new knowledge without forgetting and the other examine how fast the networks models adapt to past knowledge while relearning the past task after training on present task, both measurements called retention and relearning respectively
Retention is most commonly used as a measuring technique for continual learning including incremental class learning or task incremental learning in the machine learning community nowadays. Simply training the networks until mastering on task 1, then moving forward to task 2 and let the networks mastering on task 2 and followed by measuring the accuracy metrics on task 1 and 2 independently is categorized as one of the retention measurements
where T is the number of tasks has been encountered so far and $A_{t}$ means accuracy on tast $t$.
However, more complicated one has been proposed by
$T$ is the total tasks/sessions have been trained so far, $\alpha_{new,i}$ denotes accuracy on test set for session i direcly after learning,$\ \alpha_{base,i}$ is the measurement of accuracy on base class/first session after learning on sesion i, while $\alpha_{all,i}$ is accuracy metric on all session given model trained on session i, and $\alpha_{\text{ideal}}$ indicates the offline model accuracy on the base set, which assumes the ideal performance or sometimes many experiments in continual learning anchor multi-task learning setting as the upper-bound. While, the function of alpha ideal as divisor here for normalization for ease to compare between datasets.
$\Omega_{\text{base}}$ indicates the model’s retention relative to the first session given trained model in later sessions. $\Omega_{\text{new}}$ measures the accuracy on training session i while the model is trained on session i as well, it is used for a model’s ability to immediately recall new tasks. While, $\Omega_{\text{all}}$ denotes the measurement for how well the model retain all session after trained on session i.
Frequently overlooked by existing recent experiments, relearning is another essential measure in catastrophic forgetting which was initially proposed in physiological study by Hermann Ebbinghaus known as ‘savings’ but implemented as metrics in catastrophic forgetting by Hetherington
Practically it is measured via training the network on task 1 and task 2 sequentially, then retrain the networks on task 1 dataset and compare the time required for the network to learn task 1 on the first time against second time. Reducing time required to relearn the task 1 indicates that the networks still saved the past information.
Activation overlap initially proposed by French
where $g_{\text{hi}}$ indicates hidden layer i parameters of the networks and $g_{\text{hi}}\left( x \right)$ indicating activation output of input $x$ given parameters $g_{\text{hi}}$.
Initially proposed by
given sample a and sample b pairwise interference measure how large the interference of sample b for trained model on sample a which can be defined as follow
\[\text{PI}\left( \theta_{t};a,b \right) = J\left( \theta_{t + 1};a \right) - \ J\left( \theta_{t};a \right).\]Where, $\theta_{t + 1}$ is a model obtained after training on sample b, and $J(.)$ indicates objective function.
Contemporarily mitigating catastrophic forgetting highly involved in subfield of machine learning so-called continual learning. Recent advancement approaches in dealing with the issue encompassing exemplar/prototypical/experience rehearsal/replay buffer, parameters regularization, and architectural modification or otherwise named modular approach. In spite of those, in the recent past one year some scientists extend the study of moderating catastrophic forgetting a.k.a. continual learning to the search of connectivity with multi-task learning
Rehearsal/replay approach is dealing with catastrophic forgetting modestly by replaying the bunch of knowledge memory of past knowledge so-called “episodic memory”, e.g., samples of images, into the existing training steps while learning the novel knowledge e.g., new classes. Therefore, the catastrophic interference can be diminished as consequence of the updating parameters in respect of considering batch of combining existing datasets with small buffers of replayed episodic memory. Among others this technique was mostly explored and proposed in past five years in continual learning seeing its simplicity and effectiveness as baseline for continual learning experiments.
The similar mechanism also occurs in our brain when sleeping since our brain will reactivate and rehearse the past freshly acquired knowledges memorized periodically in hippocampus into peripheral permanent memory in the neocortex. As suggested in theory of Complementary Learning System (CLS)
However, the most challenging in rehearsal approach is both how to sampling the most significant examples and what kind of representations from the dataset that necessarily be rehearsed into future learning phase while minimizing catastrophic interference. Many of the latest research concerned with this issue along with proposing novel sampling techniques or including random sampling, uniform sampling, reservoir sampling
Measuring the any past information, including parameters, importance relevant to both past task loss value and accuracy metrics and restricting the extreme updates to this information while learning is the other strategy named regularization approach. This is conceivably conjectured as the mechanism to control plasticity-stability dilemma of the neural networks on the subject of continual updates. As consequent, the restraint adopted to the information of interest, such as parameters, guarantee the minimization of interfered information essential for the prior task.
Up till now, according to
The earliest method proposed this idea was Elastic Weight Consolidation (EWC)
where $\mathcal{L}_{B}$ is the task B loss, $\lambda$ denotes the relation of old task to new, $i$ is each parameter index, and $\theta$ is the current parameters. As exhibited on the loss equation above, the new parameters will be enforced to close to old parameters to alleviate forgetting which the precision will be controlled by fisher information matrix $F$.
While architectural-based approach mainly concerned with constructing progressive neural networks while learning novel tasks or knowledges either by growing task-specific architecture
Among others is Progressive Neural Networks proposed in 2016 as depicted in the figure 6 above. The progressive networks framework proposed addressing catastrophic forgetting through evolving task-specific networks instanting on a column for working on a task being solved. Then, as the task encountered is incremental growth, the novel column networks will be introduced which the previously learned features feasibly transferred to the new networks via lateral connections. Therefore, the last task with its associated networks are allowed to exploit all the features learned so far.
[Notes]
If you have any disapproval, correction, and critique to this article feel free to email me, I will happily adjusting and modifying this published contents respecting the corrections.