Share
Explore

Google Collab: Build your own generative AI language model and trained on my own data input from a text file.

Learning Outcome:

Using Google Collab to make your own Chat GTP trained on your own data (stored in Google Drive)

Here is a Workbook on how to create a custom generative AI language model with the Hugging Face Transformers library.
For the purpose of this guide, I will assume you are using Google Colab and a text file stored in your Google Drive, but you can easily modify the steps to work in a local environment or with data from other sources.

megaphone

This error is due to the incorrect path specified for your file. In Google Colab, when you mount your drive, the file paths should start with '/content/drive/My Drive/'. From your error message, it seems that you're missing the 'My Drive' part in your path.

Moreover, Google Drive file paths should include any folders that the file is in. If the file is directly in your drive, without any folders, the correct path would be '/content/drive/My Drive/traininga.gdoc'.
Here's the correct code to read a file:
pythonCopy code
with open('/content/drive/My Drive/traininga.gdoc', 'r') as f:
text = f.read()

from google.colab import drive; drive.mount('/content/drive') ; with open('/content/drive/My Drive/traininga.txt', 'r') as f: text = f.read();
Google Colab might not be able to read Google Docs files directly, as they are not plain text files. If 'traininga.gdoc' is a Google Docs file, you should first export it as a .txt file, upload that to your Drive, and then read it in your Colab notebook.
megaphone

Here's how to use Google Colab with a text file stored in your Google Drive:

Upload the text file to Google Drive:
First, you need to upload your text file to Google Drive. You can do this by visiting , clicking on the "+ New" button on the left, and choosing "File upload". Navigate to the location of your file in your local filesystem, select it, and click "Open" to upload it.
Open a notebook in Google Colab:
Go to , click on "File" -> "New notebook".
Mount Google Drive in your Colab environment:
In a new cell in your Colab notebook, enter the following code and run the cell:
from google.colab import drive drive.mount('/content/drive')
You'll be asked to go to a URL in a new browser window. Follow the instructions there to authorize Google Colab to access your Google Drive. You'll receive a code to paste into your notebook.
Access your text file:
Now that your Google Drive is mounted, you can access your files as if they were local.
Suppose you uploaded a file named 'myfile.txt' directly to your Google Drive (not inside any folder). You could load it into a Python string with:
from google.colab import drive; drive.mount('/content/drive') ; with open('/content/drive/My Drive/traininga.txt', 'r') as f: text = f.read();

with open('/content/drive/My Drive/myfile.txt', 'r') as f: text = f.read()
If your file is inside a folder, you should include the folder's name in the path. For instance, if 'myfile.txt' is inside a folder named 'myfolder', you would open it with:

with open('/content/drive/My Drive/myfolder/myfile.txt', 'r') as f: text = f.read()
Proceed with training your model:
You can now use this data to train your model as detailed in the previous steps.
Remember, the time it takes to read the file from Google Drive can be a bit long depending on the file's size. Be patient if it seems like the operation has stalled, especially for large files.


Step 1: Setup
First, you will need to install the necessary libraries. This includes the Transformers library for the model and the Datasets library for handling data.
pythonCopy code
!pip install transformers
!pip install datasets

Step 2: Prepare Your Data
We assume you have a text file with your own data. The text file should be in a format where each line is a separate sentence or paragraph.
Load this text file and split it into a training and validation set.
We will use 90% of the data for training and 10% for validation.
from datasets import load_dataset

# Load the data
dataset = load_dataset('text', data_files={'train': 'path_to_your_text_file.txt', 'validation': 'path_to_your_text_file.txt'})

# Split the data
dataset = dataset.train_test_split(test_size=0.1)

Step 3: Tokenization
Next, we need to tokenize the data so it can be used by the model. This involves converting the text into a format the model can understand.
pythonCopy code
from transformers import GPT2Tokenizer

# Initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Tokenize the data
def tokenize_function(examples):
return tokenizer(examples["text"])

# Apply tokenization to the data
dataset = dataset.map(tokenize_function, batched=True)

Step 4: Model Preparation
Now that our data is prepared, we can initialize the model. We will use a GPT-2 model, but you can substitute any other Transformer model that is designed for language generation.
pythonCopy code
from transformers import GPT2LMHeadModel

# Initialize the model
model = GPT2LMHeadModel.from_pretrained('gpt2')

Step 5: Training
We're ready to train our model! Let's initialize a Trainer and start the training.
pythonCopy code
from transformers import TrainingArguments, Trainer

# Specify the training arguments
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=1,
save_steps=10_000,
save_total_limit=2,
)

# Initialize the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)

# Train the model
trainer.train()

Step 6: Testing
After training, you can use your model to generate text. Here's an example:
pythonCopy code
from transformers import pipeline

# Initialize a pipeline with your trained model
generator = pipeline('text-generation', model=model)

# Generate a text
result = generator('My custom model says,')[0]
print(result['generated_text'])

That's it! You now have a custom generative AI language model trained on your own data.
Please note that training large language models like GPT-2 can require a significant amount of computational resources. If you experience memory errors or other issues during training, consider using a smaller model, reducing the batch size, or increasing the gradient accumulation steps.
Also, always remember that the quality of your model largely depends on the quality and the size of the data you train it on. A larger and well-prepared dataset often leads to a better model.

Below is the full code that you can copy and paste into a single cell in Google Colab:

# Step 1: Setup
!pip install transformers
!pip install datasets

# Step 2: Mount Google Drive to access your data
from google.colab import drive
drive.mount('/content/drive')

# Step 3: Prepare Your Data
from datasets import load_dataset
# Replace 'path_to_your_text_file.txt' with the actual path of your text file in Google Drive
dataset = load_dataset('text', data_files={'train': '/content/drive/My Drive/path_to_your_text_file.txt', 'validation': '/content/drive/My Drive/path_to_your_text_file.txt'})
dataset = dataset.train_test_split(test_size=0.1)

# Step 4: Tokenization
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def tokenize_function(examples):
return tokenizer(examples["text"])

dataset = dataset.map(tokenize_function, batched=True)

# Step 5: Model Preparation
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Step 6: Training
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=1,
save_steps=10_000,
save_total_limit=2,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)

trainer.train()

# Step 7: Testing
from transformers import pipeline
generator = pipeline('text-generation', model=model)
result = generator('My custom model says,')[0]
print(result['generated_text'])

Remember to replace 'path_to_your_text_file.txt' with the actual path to your text file in Google Drive. And note, the time to execute this code will depend on the size of your text file and the training configuration you choose.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.