Share
Explore

w24 AML3304 Project Build the Generative AI Language Model

Project Purpose and Scope:
Create a Chat Bot : Write your own Python code, you don’t need to purchase any APIs

*— If you want to do more (Image Generator, something based on the Open AI Assistant API)
error

image.png

megaphone

To purchase an API key from OpenAI, you can follow these general steps:

Go to OpenAI's Platform website at platform.openai.com and sign in with an OpenAI account.
Click your profile icon at the top-right corner of the page and select "View API Keys."
Click "Create New Secret Key" to generate a new API key.
It's important to note that you can create an OpenAI API key for free, and new free trial users receive $5 (USD) worth of credit, which expires after three months. Once your credit has been used up or expires, you can enter billing information to continue using the API of your choice
.
It's also worth mentioning that there are different models available, such as GPT-3 and GPT-4, and the process for obtaining an API key may vary based on the specific model you want to access
.
Remember to keep your API key secure, as it provides access to OpenAI's powerful AI models and should not be shared or stored as plain text
.

Resources:


Working with a Team of up to 4 people.
Hand In Format: Depends on how you do the project:
Make a TEXT File, one per team: Name that file as the TeamName.txt
Into this Text File: PUT:
names and student ids of all members
URL of Hand In Format
Collab Workbook: Share Link to (editor)
GITHUB URL of your Code: Make a Member of your Repository
image.png
Options for how to do this work:
Google Collab Notebook
megaphone

The Deliverable for the Project is a PYTHON generative Text AI Model.

Trained on any topic of Interest to you.
It is OK to have just a text console interface. I will ask questions and evaluate the responses.


Objective:

By the end of this lab, students will learn how to create embeddings using Hugging Face's Spaces API.
Learning Outcomes:
Using transformers with embeddings.
- what embeddings are
- why they are important
- how they can be applied in various machine learning tasks such as building the AI Language Model.

Prerequisites:

Basic understanding of Python programming
Familiarity with machine learning concepts {application of Baysian Training and the Interference API}: ​
An account on Hugging Face (signup at https://huggingface.co/join)

Tools:

Jupyter Notebook or any Python environment {Google Collab Notebook}
Hugging Face transformers library

Lab Overview:

Peter’s Demo Notebook:
Understanding Embeddings and Transformers: And the relationship between them.
Setting Up Environment [pip install our libraries]
Using Hugging Face Transformers Library for Embeddings
Exploring Hugging Face Spaces
Creating an Embedding Using the Spaces API

1. Understanding Embeddings and Transformers

Embeddings are dense vector representations of text where words that have similar meaning have a similar representation. They are instrumental in NLP tasks as they capture semantic information and contextual cues. We will use the transformers library from Hugging Face, which provides pre-trained models like BERT, GPT-2, etc., that can be used to generate embeddings.
Transformers are deep learning models that use self-attention mechanisms to understand the context of words in a sentence. ​The transformers library provides a unified API for using these models.

2. Setting Up Environment

First, ensure that you have Python installed, preferably within an Anaconda environment or a virtual environment. You will need to install the transformers library from Hugging Face. Run the following command to install it:
pip install transformers

3. Using Hugging Face Transformers Library for Embeddings

Start by importing the required classes from the transformers library:
from transformers import AutoTokenizer, AutoModel

In your Trello Board/ or GC Text Cell Documentation:
Provide Documentation and Explanation of what these are and how they work!
{After the course is over: You post your GC notebook as a LI Blog Article.
Make a YouTube Video and put the URL into your LI Blog.
Load a pre-trained (TEACHER) model and its tokenizer:
The purpose of this is to get access to the Tokens and Weightings in the TEACHER MODEL.

# For instance, using 'bert-base-uncased' model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
image.png
I encourage to research the other models available in the HuggingFace API to find the best model for your Assignment and Project’s Use Cases.
info

Here are some of the teacher models available on Hugging Face that you can use to train your own student AI machine learning model:

BERT: This model is a base uncased version trained on SQuAD v2. It can be loaded on the Inference API on-demand
.
DistilBERT: This model was trained by distillation of the pretrained BERT model, meaning it's been trained to predict the same probabilities as the larger model
.
GPT-4: This model is used in the GPT4Tools project to generate an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts
.
ChatGPT: This model is used as a "teacher" to generate instructional data for other large language models (LLMs)
.
ViT model: This model is used in the task-specific knowledge distillation guide provided by Hugging Face. It is used as a teacher model to distill knowledge to a MobileNet (student model)
.
VisualBERT: This is a multimodal model for vision-language tasks. It combines BERT and a pretrained object detection system to extract image features into visual embeddings
.
Remember, the specific teacher model you choose should align with the task you want your student model to learn and to deliver. For example, if you're interested in text generation, GPT-4 or ChatGPT might be suitable. If you're working on image classification, the ViT model could be a good choice.

Now, let's generate embeddings for a sentence:

import torch
from transformers import AutoTokenizer, AutoModel

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Encode some text
input_text = "Star Trek is an iconic American science fiction media franchise that paints a universe set in the future, primarily from the mid-22nd to late 24th century, where diverse crews of starships and space stations explore the cosmos, engage in political and cultural allegories, and strive for a progressive vision of humanity's future"

encoded_input = tokenizer(input_text, return_tensors='pt')

# Generate embeddings
with torch.no_grad():
model_output = model(**encoded_input)

embeddings = model_output.last_hidden_state.squeeze() # Get the embeddings
image.png
error

In your Documentation: Talk about what these methods are, what they are doing, and why they work

[9]1s
print(embeddings)
output
tensor([[-0.0824, 0.0667, -0.2880, ..., -0.3566, 0.1960, 0.5381], [ 0.0310, -0.1448, 0.0952, ..., -0.1560, 1.0151, 0.0947], [-0.8935, 0.3240, 0.4184, ..., -0.5498, 0.2853, 0.1149], ..., [-0.2812, -0.8531, 0.6912, ..., -0.5051, 0.4716, -0.6854], [-0.4429, -0.7820, -0.8055, ..., 0.1949, 0.1081, 0.0130], [ 0.5570, -0.1080, -0.2412, ..., 0.2817, -0.3996, -0.1882]])

You now have embeddings for your input text.

Next steps: perform various NLP tasks:

similarity calculation

clustering


4. Exploring Hugging Face Spaces

Hugging Face Spaces is a platform where you can share and experiment with machine learning models in the browser, powered by Hugging Face's transformers and datasets library.
Visit the Hugging Face Spaces at https://huggingface.co/spaces to explore different models and their applications.

5. Creating an Embedding Using the Spaces API

Hugging Face Spaces allows you to use models through a REST API, which makes it straightforward to query a model from anywhere. Here's a simple example of how you might access a model's API on Spaces:

import requests

API_URL = "https://api-inference.huggingface.co/models/bert-base-uncased"
headers = {"Authorization": f"Bearer {your_hf_api_token}"}

def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()

output = query({
"inputs": "Hello, how are you?"
})
Replace {your_hf_api_token} with your actual Hugging Face API token.
Note: The above code will only work if the corresponding model is deployed on Hugging Face Spaces and an API is set up. Ensure you check out how to deploy a model on Spaces at https://huggingface.co/docs.

Conclusion

In this lab, you learned about embeddings, transformers, and how to use the Hugging Face transformers library to create embeddings in Python. You also got introduced to Hugging Face Spaces and its API for accessing models in a deployed environment. Practice with different models and try deploying your own model to Hugging Face Spaces.
For additional exercises, consider experimenting with different pre-trained models and compare how the embeddings vary for the same text.

Troubleshooting

If you encounter issues, check the following:
Ensure you have the latest version of the transformers library.
Check your API token validity and permissions.
Go through the documentation for any updates or changes in the API.
Always consult the official documentation at https://huggingface.co/transformers and https://huggingface.co/docs for the most up-to-date instructions.

Does this lab use the Teacher Student approach to training our own AI MODEL?
The lab workbook I provided above does not utilize the Teacher-Student approach to training an AI model. It focuses on teaching students how to generate embeddings using pre-trained models from Hugging Face's Transformers library and accessing models through Hugging Face Spaces API.
The Teacher-Student approach, also known as knowledge distillation, is a technique where a smaller model (the student) is trained to emulate the behavior of a larger, more complex model (the teacher). In this process, the student learns from the output distributions (usually the logits or softened probability distributions) of the teacher model.
Implementing a Teacher-Student model training approach, the lab can be modified to include this advanced training technique. Here's an outline of how that could look:

Modified Lab: Implementing Teacher-Student Approach for Model Training

Additional Objective:

To learn and apply the knowledge distillation technique where a smaller student model is trained to mimic a pre-trained teacher model.

Additional Prerequisites:

Deeper understanding of neural network architecture and training
PyTorch or TensorFlow framework experience

Additional Lab Steps:

Introduction to Knowledge Distillation: Students will learn about the Teacher-Student mechanism and how it can be used to create smaller, more efficient models that retain a significant portion of the original model's accuracy.
Selection of Teacher Model: Students will select an appropriate large, pre-trained model as the teacher. For example, using BERT as a teacher model.
Selection of Student Model: Students will choose or design a smaller architecture for the student model that will learn from the teacher.
Training with Knowledge Distillation:
Setting Temperature: Students will understand and apply the concept of temperature in knowledge distillation to soften the probability distributions of the teacher's output.
Loss Function: Students will learn to use a distillation loss function that typically combines cross-entropy for the correct labels and a divergence measure between the teacher and student predicted probabilities.
Training Loop: Students will implement the training loop using PyTorch or TensorFlow, fetching data, feeding it through both teacher and student models, and updating the student model's parameters based on the combined loss function.
Evaluation and Analysis: The student model's performance will be evaluated against the teacher model to understand the trade-offs and effectiveness of knowledge distillation.

Suggested Tools and Libraries:

transformers for pre-trained teacher models
PyTorch's torch.nn or TensorFlow's tf.keras for student model implementation
torch.optim or tf.train for optimization

Additional Conclusion:

Students will have developed an understanding of and practical experience with the knowledge distillation process and will have completed a hands-on project where they have distilled knowledge from a large teacher model to a smaller student model.
This approach requires a more advanced understanding of machine learning models, optimization, and training techniques. It also involves in-depth coding and debugging sessions to ensure the student model successfully learns from the teacher. The instructions above provide a general framework, and more detailed step-by-step code would be necessary to ensure a seamless learning experience.

Teacher-Student Training Model Lecture and Lab for Google Colab [This is the model to do your Project]

Lecture Notes

What is Knowledge Distillation?

Knowledge distillation is a technique where a smaller, simpler neural network (the student) is trained to reproduce the behavior of a larger, more powerful network (the teacher). The objective is to transfer the knowledge from the teacher to the student so that the student can make accurate predictions with less computational cost.

Why Use Knowledge Distillation?

Knowledge distillation is beneficial when you need a model that is efficient enough to be deployed on devices with limited computational resources, such as mobile phones or embedded systems, without significantly compromising accuracy.

How Does Knowledge Distillation Work?

In knowledge distillation:
The teacher model's output (softened logits) is used as a target for training the student model.
A temperature parameter T is applied to soften the probabilities, which helps in transferring the "dark knowledge" (i.e., information contained in the output distribution apart from the highest probability class) from the teacher to the student.
The student learns by minimizing a loss function that is usually a combination of the standard prediction loss (e.g., cross-entropy with true labels) and a distillation loss that measures the divergence between student and softened teacher outputs.

Lab Code Instructions

Below is an example outline for a knowledge distillation lab for Google Colab. Make sure to provide additional explanations and resources to help students understand each component of the lab.
Step 1: Setup Environment in Google Colab
Students should start a new notebook in Google Colab and install the necessary libraries by executing the following code in the notebook:
!pip install torch torchvision transformers
Step 2: Import Libraries

import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import logging

logging.set_verbosity_error() # To reduce unnecessary output

# Check if we have a GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Step 3: Load Teacher Model
1
2
3
# Load pre-trained DistilBert as the teacher model
teacher_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
teacher_model.to(device).eval() # Use eval mode for inference purposes
Step 4: Define Student Model
Explain to the students that the student model can be a smaller transformer or any architecture that they believe can learn from the teacher. For simplicity, below is an example of a small neural network as a student model.

class StudentModel(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(StudentModel, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)

def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out

# Define student model
input_size = 768 # For BERT, this is the size of the hidden state
hidden_size = 256 # Smaller hidden size to reduce complexity
num_classes = 2 # Binary classification

student_model = StudentModel(input_size, hidden_size, num_classes).to(device)
Step 5: Load Dataset
Here, explain how to load a dataset that is relevant to the task. For NLP tasks, students should tokenize and encode the dataset properly using the tokenizer that matches the teacher model.
Step 6: Knowledge Distillation
Walk students through the knowledge distillation training loop. Students should use the teacher model to generate labels for training the student.

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def distillation_loss(student_logits, teacher_logits, T):
teacher_probs = nn.functional.softmax(teacher_logits / T, dim=-1).to(device)
student_log_probs = nn.functional.log_softmax(student_logits / T, dim=-1).to(device)
distillation = nn.functional.kl_div(student_log_probs, teacher_probs, reduction='sum') * (T * T) / student_logits.shape[0]
return distillation

optimizer = Adam(student_model.parameters(), lr=2e-5)

for epoch in range(num_epochs):
for batch in dataloader: # Assuming dataloader is defined and set up
# Forward pass through the teacher model
teachers_logits = teacher_model(...)

# Forward pass through the student model
students_logits = student_model(...)

# Compute and backpropagate distillation loss
loss = distillation_loss(students_logits, teachers_logits, T=2) # Experiment with temperature
optimizer.zero_grad()
loss.backward()
optimizer.step()
Explain each part of the code to students, including why and how temperature T is used, and the reason for using KL divergence for the distillation loss.
Step 7: Evaluation
Guide students to evaluate the student model's performance by comparing it to the teacher's performance and baseline accuracy.
Conclusion
Summarize the learning outcomes, encourage experimentation with different architectures and temperatures, and provide guidance on how to further improve the student model's performance.
Please ensure to test and validate the given instructions and code in Google Colab before sharing with students to guarantee a seamless first-run experience.

megaphone

To load the Star Trek scripts dataset from Kaggle into a Google Colab notebook, you'll need to follow a set of steps to authenticate with Kaggle, download the dataset, and load it for use in your machine learning models. Below, I'll outline the typical steps to perform this task and provide you with a code example. Bear in mind that since I don't have access to the internet in this environment, I will not be able to run the code, but it should work in your Google Colab environment.

Step 0: Prerequisites Before you start, ensure that you have a Kaggle account and an API token (kaggle.json) obtained from the Kaggle website. You will need this to interact with the Kaggle API.
Step 1: Install the Kaggle API client
!pip install kaggle
Step 2: Upload the Kaggle API Token
In Google Colab, upload your kaggle.json file containing your API credentials. You can do this using the Colab file upload feature in the left sidebar.
Step 3: Set up Kaggle API Credentials

import os

# Make a directory for Kaggle API Token
os.makedirs('/root/.kaggle', exist_ok=True)

# Copy the kaggle.json to the folder expected by the Kaggle API client
!cp /content/kaggle.json /root/.kaggle/

# Secure the file
!chmod 600 ~/.kaggle/kaggle.json
Step 4: Download the Star Trek Scripts Dataset
You'll need to know the exact slug from Kaggle for the dataset you wish to download.
!kaggle datasets download -d gpreda/star-trek-scripts -p /content
Step 5: Unzip the Dataset Files
!unzip /content/star-trek-scripts.zip -d /content/star-trek-scripts
Step 6: Load and Process the Dataset
Now, depending on the format of the Star Trek scripts dataset (which is usually in CSV or JSON format), you can load the data using Pandas. Let's assume it's a CSV for this example.

import pandas as pd

# Load the dataset into a pandas dataframe
data_path = '/content/star-trek-scripts/all_series_lines.csv'
df = pd.read_csv(data_path)

# Check the first few entries in the dataframe
print(df.head())
Step 7: Preprocess and Tokenize the Data
The next step is to preprocess and tokenize the data, which is necessary for working with text data in NLP models. Here's an example using DistilBertTokenizer to tokenize script lines for a classification task:

from transformers import DistilBertTokenizer

# Initialize the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Let's assume we want to consider only the 'line' column for tokenization
text_data = df['line'].astype(str).tolist() # Convert the column to a list of strings
max_length = 128 # or any other value suitable for your task

# Tokenize the text data
encoded_data = tokenizer(
text_data,
padding=True,
truncation=True,
max_length=max_length,
return_tensors='pt'
)

# Now, `encoded_data` contains input_ids, attention_mask, and other components
# required by transformers models.
Now you have the Star Trek scripts dataset loaded and tokenized in your Google Colab notebook. You can proceed with splitting the data into training and testing sets, and then use them with your teacher-student model setup. Remember to test and adapt each code snippet as required according to the details of the dataset such as file names and textual content structure.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.