The foundational AI language models that can be used as a teacher for transfer learning are:
GPT-3 (Generative Pre-trained Transformer 3): GPT-3 is a state-of-the-art language model developed by OpenAI. It is known for its ability to perform a wide range of natural language processing tasks and has been used for transfer learning in various applications
BERT (Bidirectional Encoder Representations from Transformers): BERT, developed by Google, is another influential language model that has been widely used for transfer learning in natural language processing tasks. It has been employed for tasks such as text classification, named entity recognition, and question answering
T5 (Text-to-Text Transfer Transformer): T5 is a versatile language model developed by Google that has shown effectiveness in transfer learning for various NLP tasks, including translation, summarization, and question answering
These models have been pre-trained on large datasets and can be fine-tuned for specific tasks using transfer learning techniques, making them valuable resources for developing custom AI language models
The two language models you mentioned are Claude and LLaMA.
Claude: Claude is a cutting-edge language model developed by Anthropic, designed to advance natural language processing (NLP) research. It focuses on constitutional AI, shaping AI outputs guided by principles to ensure helpful, harmless, and accurate AI assistance. Claude has been trained on internet data, codes, instructions, and human feedback, ensuring the quality of the model
LLaMA (Large Language Model Meta AI): LLaMA is a new open-source large language model developed by Meta AI. It is designed to be a versatile and powerful model that can be used for various tasks, including query resolution, natural language comprehension, and reading comprehension. LLaMA is still under development and is intended for educational applications, making it suitable for AI assistance in EdTech platforms
Your project is going to be to create your own Generative AI Language Model,
using the Student Teacher training pattern.
We will demonstrate this using Google Collab as our Build Platform.
How do you deploy a big fat deep learning model in production?
One of the best techniques is knowledge distillation.
To deploying a big deep learning model in production, Knowledge distillation is the currently accepted best practices technique. Train a smaller model to mimic a larger one's generalization power by using its output probability distribution as a soft target. Cut latency in half with minimal accuracy loss
Knowledge distillation trains a smaller model, the student, to mimic the generalization power of a larger model, the teacher.
How is this different from training a model from scratch?
With more complex models, the theoretical search space in larger than that of a smaller network.
If we assume that the same (or even similar) convergence can be achieved using a smaller network, then the convergence space of the Teacher Network should overlap with the solution space of the student network.
Unfortunately, that alone does not guarantee converge for the student network at the same location. The student network can have a convergence which might be hugely different from that of the teacher network.
However, if the student network is guided to replicate the behavior of the teacher network (which has already searched through a bigger solution space), it is expected to have its convergence space overlapping with the original Teacher Network convergence space.
Teacher Student networks — How do they exactly work?
Train the Teacher Network : The highly complex teacher network is first trained separately using the complete dataset. This step requires high computational performance and thus can only be done offline (on high performing GPUs).
An example of a highly complex and Deep Network which can be used as a teacher network : GoogleNet
2. Establish Correspondence : While designing a student network, a correspondence needs to be established between intermediate outputs of the student network and the teacher network. This correspondence can involve directly passing the output of a layer in the teacher network to the student network, or performing some data augmentation before passing it to the student network.
image.png failed to upload
An example of establishing correspondence
3. Forward Pass through the Teacher network : Pass the data through the teacher network to get all intermediate outputs and then apply data augmentation (if any) to the same.
4. Backpropagation through the Student Network : Now use the outputs from the teacher network and the correspondence relation to backpropagate error in the student network, so that the student network can learn to replicate the behavior of the teacher network.
There have been a lot of new modification suggested to the traditional student teacher described above, like introducing multiple teacher (i.e. converting an ensemble into a single network), introducing a teaching assistant (the teacher first teaches the TA, who then in turn teaches the student) etc. However, the field is still pretty young and is quite unexplored in many dimensions.
The key idea is that, instead of training the student with the same labeled data as the teacher, we use the output probability distribution of the teacher as a soft target to train the small model.
In a standard training process, the teacher learns to discriminate between many classes and maximizes the probability of the correct label.
However, a side-effect is that the model assigns smaller probabilities to other classes, which can give us a lot of knowledge about how the model generalizes.
For example, an image of a cat can have a low probability of being mistaken for a tiger, but that mistake is still many times more probable than mistaking it for a chair. We can take advantage of this knowledge to improve the student.
The student is a network with fewer parameters than the teacher.
It is recommended to use the same structure, for example, if we want BERT as a teacher, we can use DistillBERT, which has 40% fewer parameters.
The student training loss is a combination of the original training loss of the teacher and a distillation loss.
The distillation loss simply softens the probability distribution of the correct classes by scaling it with a temperature parameter.
The effect is this temperature is effectively reducing the higher probabilities and increasing the smaller ones, to make a softer distribution that has more knowledge.
Knowledge distillation can cut the latency of a machine learning model by half at the cost of reducing the model's accuracy minimally.
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (