Transfer learning is a process through which a machine learning model trained for a different task, is fine-tuned to a new task. Through this process, the knowledge that the earlier model had learned is transferred to the new task, which can take advantage of the learned patterns. In human learning, transfer is defined as the ability to apply skills learned in a certain subject to a different context such as applying the knowledge of physics to economics.
Neural networks are functions that can approximate any arbitrary function to any level of precision given enough data and training time. They are a combination of a mathematical function and a set of parameters. For similar application areas, the mathematical function is the same but the parameters vary depending on which task the model has been trained on. Training involves feeding data into the function and creating a prediction (called the forward pass), the prediction is then compared to the expected output, and the cost is calculated. The cost is how much error the model has made. Using tricks from the calculus chain rule a derivative is calculated that indicates how nudges to the parameters increase or decrease the loss. Once this is calculated, the parameters are adjusted slightly in the direction that reduces the loss. This process is repeated until the loss is acceptable or a certain time limit is reached.
Deep learning uses neural networks that have multiple layers. It has achieved better than human results in computer vision tasks such as image classification (telling what objects are in an image), object detection (where objects are in an image), and many other application areas. To train deep neural networks, millions of examples are fed through the network, and over multiple iterations in GPUs (special processing units), the networks learn to their set tasks with impressive accuracy. This process is both energy and time-intensive.
What Neural Networks Learn
Researchers (Yosinki et al., 2014) showed that neural networks learn in an interesting pattern. For computer vision tasks, the first layers learn to detect color blobs, Gabor filters, and edges while the middle layers learn to detect shapes and the later layers learn the specifics of the task at hand. It was shown that the first layers learn more general features of images while the later layers specialize in the task at hand.
This was further explored (Mathew et al., 2013) and it was shown that in image classification tasks we can clearly see how the network learns the general features in the earlier layers and specializes as the network deepens.
|Dog’s face, e.t.c
In normal computer programs, programmers tell the computer the exact steps it needs to accomplish a task. However, in neural networks, we create a set of parameters called weights (Arthur, 1962), feed labeled examples to the model, and train it to adjust the weights to match the task at hand. At the end of the training, the parameters of a model encode the knowledge that it has learned. In the above example, the parameters for layer 1 encode the knowledge to detect color blobs and filters.
The same happens for models trained on different tasks such as natural language processing and tabular data. When trained on tabular data, the parameters encode similar knowledge about the dataset in the same hierarchy where the earlier layers learn general features and later layers specialize. This means that instead of starting from scratch when we train a model, we can share knowledge between machine learning models by sharing the learned parameters.
This technique allows for models to be pre-trained on extremely large datasets once, then the parameters are used as seeds for new models in similar domains. The new models trained in this fashion, will retain the knowledge learned in the bigger dataset and apply it to the current task at hand. This sounds too good to be true, and it does have its gotchas which we look at below.
How to transfer knowledge between models
The process of transferring knowledge between a pre-trained model and a new model is called fine-tuning. This is the process by which we take the larger, general pre-trained parameters and specialize them for the specific task at hand. It is like taking a pre-made dough and adding your special ingredient to make a new delicious recipe.
Fine-tuning is normally done by chopping off the last few layers of a network that have specialized to the previously trained task and adding random layers. We then freeze the earlier layers (to preserve what they have learned) and then train the newly added layers to fit our current task. During training two things happen, one is that the randomly added parameters are adjusted to work well with those frozen above, and then the network is unfrozen and further fine-tuned to the new task.
While retraining, special care has to be taken to make sure that we don’t ruin the pre-trained parameters while training for the new task. This is done by a technique called discriminative learning rates, where earlier layers are adjusted more slowly than the newly added random parameters in the randomly added layers. This preserves all those color blobs, filters and edge detection skills learned earlier in the larger dataset.
Not all applications of deep learning have available large pre-trained models, when this is the case, another technique called progressive resizing comes to the rescue. It works by training the model with smaller size images from our dataset, and then progressively increasing the size of the images used in training until we hit the limit of our available image size. This neatly works out in the same fashion as transfer learning, since the model learns skills in the small images that it can apply to the larger images. This also increases the amount of data the model uses to learn, thus increasing its accuracy and generalization.
Model zoos contain freely available large pre-trained models for various tasks. When starting on a deep learning project, it helps to look for available models to transfer knowledge from. It is important to pay attention to how the model was trained retrieving specific configurations such as the mean and standard deviation of the training data since this must also be used in our new model training.
Transfer learning clearly has its benefits but it does have an Achilles heel. The main problem is catastrophic forgetting. It has been shown that transfer learning does not work really well for tasks that vary too much. Unlike humans who are able to easily transfer what we have learned to a different context, neural networks struggle with this. Networks are unable to generalize beyond the domain of the task. This leads to some or most of the knowledge being lost in the fine-tuning phase. Discriminative learning rates try to address this issue.
Furthermore, when using progressive resizing, care must be taken to ensure that if we are using a pre-trained model trained on a similar dataset to our task, we do not use smaller resolution images. Using smaller, similar images to what the model was pre-trained on, damages the pre-learned weights, so we have to be careful when fine-tuning with similar datasets.
Another issue to pay attention to is domain shift. The type of data a company collects changes over time, the world keeps evolving, human behavior changes and decisions change on context. What the pre-trained model learned, could no longer be relevant to the current use case, and the transfer would result in a model that no longer represents the ground truth.
Apart from domain shifts, the data that models are trained on normally contain different types of bias. Even if the data sampling was perfect, humans are biased and the bias is encoded in the data. Historical biases such as racial and gender inequity could be present in our model. When we transfer, this same bias might replicate in our new model.
Writing this, a few ideas popped into my mind that I would like to research and cover in future posts. First, since deep neural networks can contain many layers how can we know when the generalization stops and the specialization starts? How does the various skill development within the layers develop as we go deeper into the network? In computer vision applications, de-convolution networks can be used to see what each layer learns, and I wonder if we can apply them to help practitioners using transfer learning to decide where to start freezing the network given different pre-trained models and tasks to be fine-tuned.
I am also excited about the prospect of combining knowledge graphs, which represent knowledge in a human and machine-readable format, and graph neural networks to address the shortcomings of transfer learning. Even though we can see what the layers are learning, it is hard to see what was in the training data for the sake of investigating bias and domain shift. Knowledge graphs would enable humans in the loop to apply expert knowledge and data engineering to further enable successful transfer.
Are you interested in any of these topics? I would love to chat!
Yosinski J, Clune J, Bengio Y, and Lipson H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27 (NIPS ’14), NIPS Foundation, 2014.
Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks, 2013.
Arthur L Samuel. Artificial Intelligence: A frontier for automation, 1962.