Share
Explore

Lab Notebook: Creating a Generative Language Model using the Gutenberg Corpus : Python Code Lab

This can be a Template for your Project

Objective

In this lab, we will create a generative language model using the Hugging Face Transformers library and train it on the Gutenberg Corpus.
We will then learn how to use the model to generate text based on user queries. (This is what the CHAT web interface in Chat GPT does).
Hugging Face Transformers is a popular open-source library for natural language processing (NLP) tasks such as text classification, question answering, and language translation. It provides pre-trained models for various Natural Language Processing tasks, as well as tools for fine-tuning and training custom models.
Here's an example of how to use the Hugging Face Transformers library in Python to perform language translation in just three lines of code.

from transformers import pipeline
translator = pipeline("translation_en_to_fr")
result = translator("Hello, world!")
print(result)
In this example, we first import the pipeline function from the transformers module.
We then create a translator pipeline using the pipeline function and specifying the translation direction as English to French.
Finally, we use the translator pipeline to translate the input text "Hello, world!" and print the result.
The Hugging Face Transformers library also provides a high-level API called the pipeline API, which allows users to perform various NLP tasks with just a few lines of code. Here's an example of how to use the pipeline API to perform text classification:
The ability to speak in various personalities (e.g. Isaac Newton) must be due to the fact that someone has created transformer pipelines to do this: Essay People: Investigate and Discuss.

from transformers import pipeline

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie was terrible!")
print(result)
In this example, we first create a text classification pipeline using the pipeline function and specifying the task as text classification and the model as distilbert-base-uncased-finetuned-sst-2-english. We then use the classifier pipeline to classify the input text "This movie was terrible!" and print the result.
Overall, the Hugging Face Transformers library provides a powerful and flexible set of tools for NLP tasks, with a wide range of pre-trained models and a user-friendly API for fine-tuning and training custom models.

What is a transformer library:

A transformer library is a type of library used for natural language processing (NLP) tasks such as text classification, question answering, and language translation.
It is based on the transformer architecture, which was introduced in the "Attention is all you need" paper.
The URL to download the "Attention is All You Need" paper is available on multiple websites. Here are a few options:
• The paper is available on the NeurIPS website: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

• A review of the paper with a link to the original paper is available on GitHub: https://gogl3.github.io/attention-is-all-you-need/
The transformer architecture is a type of sequence-to-sequence (Seq2Seq) model that uses self-attention mechanisms to process input sequences and generate output sequences.
Transformer libraries provide pre-trained models for various NLP tasks, as well as tools for fine-tuning and training custom models.
Examples of transformer libraries include the Hugging Face Transformers library and the PyTorch-Transformers library. These libraries provide a wide range of pre-trained models such as BERT, RoBERTa, and GPT, which can be used to solve many NLP tasks

Prerequisites

Python 3.6 or higher
PyTorch
Hugging Face Transformers
Dataset library
tqdm

Table of Contents

Setup
Preprocessing the Gutenberg Corpus
Training the Model
Querying the Model
Conclusion

1. Setup


First, let's install the required libraries.

pip install torch transformers datasets tqdm


Now, let's import the necessary modules.

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from tqdm import tqdm


2. Preprocessing the Gutenberg Corpus

Download the Gutenberg dataset using the load_dataset function.
gutenberg_dataset = load_dataset("gutenberg")

Next, let's tokenize the text using the GPT-2 tokenizer. We'll also create a custom dataset class to handle the tokenized text.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

class GutenbergDataset(Dataset):
def __init__(self, texts, tokenizer):
self.tokenizer = tokenizer
self.inputs = []
for text in tqdm(texts, desc="Tokenizing"):
self.inputs.extend(tokenizer(text)["input_ids"])

def __getitem__(self, idx):
return self.inputs[idx]

def __len__(self):
return len(self.inputs)


Now, let's create the dataset and data loader for training.
train_dataset = GutenbergDataset(gutenberg_dataset["train"]["text"], tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8)

3. Training the Model

First, let's configure our GPT-2 model and create a training argument.
config = GPT2Config.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", config=config)

training_args = TrainingArguments(
output_dir="./gpt2_gutenberg",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=8,
save_steps=10_000,
save_total_limit=2,
logging_steps=500
)



Next, let's define the data collator to handle padding during training.
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)


Now, we can create the trainer and train the model.

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset
)

trainer.train()



Finally, let's save the trained model.
trainer.save_model("./gpt2_gutenberg")

4. Querying the Model

First, let's load the saved model and tokenizer.

model = GPT2LMHeadModel.from_pretrained("./gpt2_gutenberg")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Now, let's define a function to generate text based on user queries.
def generate_text(prompt, model, tokenizer, max_length=50, num_return_sequences=1):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(
input_ids,
max_length=max_length,
num_return_sequences=num_return_sequences,
no_repeat_ngram_size=2,
pad_token_id=tokenizer.eos_token_id
)
return [tokenizer.decode(output_sequence, skip_special_tokens=True) for output_sequence in output]

Finally, let's present a query to the model.
query = "What is the importance of literature in society?"
generated_text = generate_text(query, model, tokenizer)
print(generated_text[0])

5. Conclusion


In this lab, we created a generative language model using the Hugging Face Transformers library and trained it on the Gutenberg Corpus.
We also learned how to generate text based on user queries.
This can be further fine-tuned and optimized based on the desired application and dataset.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.