Building the Simplest LLM with Jupyter Notebook: A Student's Guide

This Lab Book is the starting point of your Assignment:
Tooling Setup:
We will be using Anaconda Python which includes Jupyter Notebook:

Learning outcomes:

How to Build the simplest LLM with Jupyter Notebook. You will build a toy version, prototype, proof of concept, MVP Minimal Viable Product.


In this lab learning notebook, you will learn how to build the simplest Language Model (LLM) using Jupyter Notebook.
We will use Python and the nltk library to create a basic language model. This is a minimal viable product (MVP) designed to be as simple as possible while providing a complete and detailed implementation template and set of recipes.

Table of Contents

Introduction to Language Models
Setting up Jupyter Notebook
Importing Libraries
Preparing the Dataset
N-Gram Model
Generating Text

1. Introduction to Language Models

A language model is a probabilistic model that is used to predict the likelihood of a sequence of words appearing in a given context.
It is commonly used in natural language processing (NLP) tasks such as speech recognition, machine translation, and text generation.

2. Setting up Jupyter Notebook

To get started, you need to install Jupyter Notebook on your computer. Follow these steps:
Install Anaconda Python: Download and install the latest version of Anacondo Python from
Install Jupyter Notebook: Open a terminal/command prompt and run the following command:
pip install jupyter
Launch Jupyter Notebook: Type jupyter notebook
in your terminal/command prompt, and a new browser window should open with the Jupyter Notebook interface.
Create a new Python notebook by clicking on the "New" button and selecting "Python 3".

3. Importing Libraries

In this lab, we will use the Natural Language Toolkit (nltk) library. To install it, open a new cell in your Jupyter Notebook and run the following:
Copy code
!pip install nltk
Now, import the necessary libraries:
Copy code
import nltk
import random
from nltk.util import ngrams
from collections import defaultdict, Counter

4. Preparing the Dataset

For our simple LLM, we will use a sample text. You can replace this with your own dataset if desired. Paste the following code in a new cell:
Copy code
sample_text = """
Once upon a time, in a land far, far away, there lived a king and queen who had a beautiful daughter. The princess was kind and gentle, and everyone loved her.

5. Tokenization

Tokenization is the process of breaking a text into individual words or tokens.
We will use the nltk.word_tokenize() function to tokenize our sample text. Run the following code:
Copy code'punkt')
tokens = nltk.word_tokenize(sample_text.lower())

6. N-Gram Model

An N-gram is a contiguous sequence of n items from a given sample of text.
We will create a simple bigram model (n=2) for our LLM. Run the following code in a new cell:
Copy code
bigrams = list(ngrams(tokens, 2))
bigram_freq = defaultdict(Counter)

for w1, w2 in bigrams:
bigram_freq[w1][w2] += 1

This code creates a dictionary of bigrams and their frequencies.

7. Generating Text

Now that we have our bigram model, we can use it to generate text. Run the following code in a new cell:
Copy code
def generate_text(seed, n_words):
result = [seed]
for _ in range(n_words):
next_word_options = bigram_freq[result[-1]]
next_word = random.choices(list(next_word_options.keys()), list(next_word_options.values()))[0]
return ' '.join(result)

generated_text = generate_text('princess', 5)
This code defines a function generate_text() that accepts a seed word and generates a sequence of words using the bigram model.

8. Conclusion

Congratulations! You have successfully built the simplest LLM using Jupyter Notebook. This basic language model demonstrates the core concepts of NLP, including tokenization and n-grams.
Although simple, it can be expanded and improved for more complex applications. Keep experimenting and learning to enhance your NLP skills!

Expanding Your Simplest LLM with Jupyter Notebook
In this tutorial, we will build upon the simplest LLM we created previously. We will show you how to add more text to your model, train it, and ask more questions to get better answers. We'll cover the following steps:
Set up Jupyter Notebook
Import necessary libraries
Prepare the dataset
Tokenize the text
Create a trigram model
Train the model with more text
Generate text with various questions

1. Set up Jupyter Notebook

Follow the same steps as in the previous tutorial to set up Jupyter Notebook.

2. Import necessary libraries

Copy code
import nltk
import random
from nltk import word_tokenize, sent_tokenize
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

3. Prepare the dataset

Load your dataset and combine it with new text data. Make sure the new text is clean and well-formatted.
Copy code
old_text = "your_previous_text_data"
new_text = "your_new_text_data"
combined_text = old_text + " " + new_text

4. Tokenize the text

Tokenize the combined text into sentences and words.
Copy code
sent_tokens = sent_tokenize(combined_text)
word_tokens = [word_tokenize(t) for t in sent_tokens]

5. Create a trigram model

We'll use a trigram model this time, which considers three words at a time, to improve the model's performance.
Copy code
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, word_tokens)

6. Train the model with more text

Instantiate the MLE model and fit it with the training data.
Copy code
model = MLE(n), padded_sents)

7. Generate text with various questions

Now, you can ask more questions and generate text based on different input words or phrases.
Copy code
def generate_text(prompt, num_words, model):
word_list = model.generate(num_words, text_seed=prompt.split())
response = ' '.join(word_list)
return response

# Example questions
questions = [
"What is the importance",
"How does it work",
"What are the benefits",
"How can I improve",
"What should I consider"

for question in questions:
print(f"Question: {question}")
print(f"Answer: {generate_text(question, 20, model)}")
This expanded LLM will provide more accurate and diverse answers based on the larger dataset. Continue experimenting with different datasets, model architectures, and training techniques to further enhance your NLP skills.

Now let'provide a sample Jupyter Notebook tutorial on creating a simple LLM (language model) using the nltk library.
This tutorial includes setting up the environment, importing necessary libraries, preparing the dataset, tokenizing the text, creating a bigram model, training the model with more text, and generating text based on user inputs.
Set up the environment:
Install Jupyter Notebook if you haven't already.
Create a new Jupyter Notebook in your desired directory.
Import necessary libraries:
Copy code
import nltk
from nltk import bigrams, FreqDist
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from random import choice
Prepare the dataset:
Copy code
# Sample text data
text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence \
concerned with the interactions between computers and human language. In particular, it focuses on programming \
computers to process and analyze large amounts of natural language data."

# Tokenize the text
tokens = nltk.word_tokenize(text)
Create a bigram model:
Copy code
# Generate bigrams and their frequency distribution
bigrams = list(ngrams(tokens, 2))
bigram_freq_dist = FreqDist(bigrams)

# Prepare the dataset for training
train_data, padded_sents = padded_everygram_pipeline(2, tokens)
Train the model:
Copy code
# Train the bigram model
model = MLE(2), padded_sents)
Generate text based on user inputs:
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.