Slimming Down AI Models: the Power of Distillation
Jordan Zeiger, Tensility Intern and Undergraduate Student at Cornell University College of Engineering
Armando Pauker, Managing Director at Tensility Venture Partners
Wayne Boulais, Managing Director at Tensility Venture Partners
Introduction
The big headlines in the AI world seem to only showcase the release and development of ever larger models with capabilities across many disciplines. OpenAI’s new release of GPT-4 is estimated to have an astonishing 1.76 trillion parameters [1] - a huge step up from GPT-3 which was trained on 175 billion parameters just 3 years earlier. However, as impressive as these models are, they come with significant potential drawbacks in cost and performance when used in high volume, low latency applications. Big companies and startups alike have started to realize this, and the trend in AI is shifting towards smaller, lighter models [2].
This blog highlights a way to achieve performance goals without the hefty price tag: model distillation. Model distillation, or knowledge distillation, works by transferring knowledge from a large, complex model to a smaller, more efficient one, significantly reducing costs while maintaining low latency. This approach not only lowers the cost of using Large Language Models (LLMs) but also enables tailored solutions for specific tasks, making model distillation a
compelling strategy for a plethora of applications. Model distillation is one method to instantiate Small Language Models (SLMs), whose rise and potential applications we discussed in previous blogs [3, 4]. This approach becomes more compelling as companies contemplate how to move from proprietary models like Open AI and Gemini to open source models like Llama and Mistral.
Model Distillation Background
Model distillation is a method where knowledge from a large, sophisticated model, known as the teacher model, is transferred to a smaller, more efficient model, known as the student. This technique, introduced in a breakthrough paper from 2015 titled “Distilling the Knowledge in a Neural Network” by Hinton et al. [5] , allows the student model to replicate the teacher's performance with far fewer parameters. The general method is illustrated in Figure 1 below. Essentially, the student learns to mimic the teacher, capturing the essential features and patterns in a much lighter package. This process ensures that the distilled model remains powerful yet lean.
Figure 1: The generic teacher-student framework for knowledge distillation. {https://arxiv.org/abs/2006.05525}
Why Model Distillation?
Model distillation offers several significant advantages that make it an attractive alternative to using an off-the-shelf LLM. While the latest LLMs are impressive in their range of capabilities across various applications from poetry to medicine to math problems, SLMs excel when utilized for specific implementations. Recent research has shown cases where a distilled model has been used to reduce the size of a model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster [6]. Applications can utilize a distilled model for their needs without having to deal with the prohibitive latency and costs of operating a full-scale LLM.
Another enticing reason to utilize model distillation is the concept of vendor independence or risk reduction. Essentially, this means having control over one's own technologies. When relying on an LLM trained and operated by another company, there is a lack of control. Changes to the LLM weights or training data can be made at any moment, potentially negatively impacting operations. By distilling a proprietary model, complete control is maintained, ensuring stability and reliability while still maintaining the benefits of the desired model.
One important caveat is that while a distilled LLM can be an expert at a specific task, it will lose some amount of accuracy and capability in doing the vast variety of tasks where the full size LLM may have excelled. This is a function of the parameter reduction in going from a large model with many parameters to a small model with fewer parameters. For example, an SLM trained to be performant only in sales for an e-commerce application may not be adept at SAT questions or solving physics problems. This tradeoff must be a conscious decision when moving to a specifically trained, smaller model.
Types of Model Distillation
Understanding the different types of model distillation helps in choosing the right approach for specific needs. Within the umbrella of model distillation there are numerous types and variations of methods. However, most broadly all of these methods can be described as either white or black box distillation [7]. White box distillation requires having full access to the internal workings of the teacher model. This access means having the ability to inspect and utilize the intermediate representations and feature maps of the teacher model during the distillation process. The primary advantage of white box distillation is that the insight into the teacher model's behavior allows for a more comprehensive transfer of knowledge, including intermediate representations, potentially leading to better performance of the student model. However, the white box distillation process can be resource-intensive due to the need to process and transfer intermediate data. Further, white box distillation may not always be possible, as full access to the internal workings of the teacher model may not be feasible - especially when using an proprietary LLM as a teacher.
Figure 2 below describes how knowledge distillation can be done with knowledge taken from many different parts of the teacher model. The white box approach presumes complete knowledge and access to the workings of the teacher model, and is therefore able to utilize all types of knowledge labeled in Figure 2.
Figure 2: White box and black box approaches to model knowledge distillation, image created by Tensility Venture Partners.
If the resources for white box distillation are unavailable, black box distillation is an alternative method. This scenario is increasingly common and especially important as companies consider how to move from initial, proof-of-concept implementations with proprietary models like OpenAI and Gemini to full volume production with open-source models. This transition is attractive for a number of reasons as described above. As shown in Figure 2 above, this approach does not have access to the hidden layers. In contrast to white box distillation, black box distillation only needs access to the inputs and outputs of the teacher model, with the internal workings remaining hidden. The student model is trained to mimic the output responses of the teacher model. This is a much simpler and less resource intensive approach, as it does not require access to the internal architecture of the teacher model. Since so little access to the teacher model is required for black box distillation, this means it is possible to distill even the most advanced and restricted models where internal information is unknown. This highlights the attractiveness of black box distillation for the move to small, open source models.
The drawback with this method is that the student model may not capture the nuanced behaviors or structures of the teacher model, as it is learning solely off of the inputs and final outputs with no access to any additional information. The only way to combat this issue is to use an abundance of data with as many edge cases as possible to ensure acceptable future model performance. Notably, this data set should have limited synthetic data. The most important aspect of this data in regards to model distillation is that it is varied. Since in black box testing there is no access to the internal structure of the teacher model, the only way to best represent its nuances, intricacies, and every aspect of the teacher model is to use the most varied data that captures the outliers.
Conclusion
Model distillation represents a powerful strategy in the AI toolkit, addressing challenges such as transitioning from proprietary models to open-source alternatives and optimizing for edge computing. As AI continues to advance, the leap from proprietary LLMs such as GPT to open-source models like Llama and Mistral will be an important challenge that can dramatically reduce costs and latency. The growing interest in the deployment of AI on edge devices with limited resources further expands the range of potential applications enhanced by model distillation. The overarching goal is to increase the likelihood that AI applications will meet their success criteria in production environments.
By creating their own distilled models, companies can also achieve vendor independence, gaining control over their own futures and mitigating risks associated with uncontrolled changes in third-party LLMs. This control ensures stability and reliability, safeguarding operations from potential disruptions. Businesses are encouraged to explore model distillation and other optimization methods to enhance their AI strategies and remain competitive in an increasingly AI-driven world.