Explore

Building the Simplest LLM with Jupyter Notebook: A Student's Guide

This Lab Book is the starting point of your Assignment:

Tooling Setup:

We will be using Anaconda Python which includes Jupyter Notebook:

⁠

Learning outcomes:

How to Build the simplest LLM with Jupyter Notebook. You will build a toy version, prototype, proof of concept, MVP Minimal Viable Product.

Introduction:

In this lab learning notebook, you will learn how to build the simplest Language Model (LLM) using Jupyter Notebook.

We will use Python and the nltk library to create a basic language model. This is a minimal viable product (MVP) designed to be as simple as possible while providing a complete and detailed implementation template and set of recipes.

Introduction to Language Models

Setting up Jupyter Notebook

Importing Libraries

Preparing the Dataset

Tokenization

N-Gram Model

Generating Text

Conclusion

1. Introduction to Language Models

A language model is a probabilistic model that is used to predict the likelihood of a sequence of words appearing in a given context.

It is commonly used in natural language processing (NLP) tasks such as speech recognition, machine translation, and text generation.

2. Setting up Jupyter Notebook

To get started, you need to install Jupyter Notebook on your computer. Follow these steps:

Install Anaconda Python: Download and install the latest version of Anacondo Python from

⁠

Anaconda | The Operating System for AI Democratize AI innovation with the world’s most trusted open ecosystem for data science and AI development. www.anaconda.com⁠

⁠

Install Jupyter Notebook: Open a terminal/command prompt and run the following command:

pip install jupyter

Launch Jupyter Notebook: Type jupyter notebook

in your terminal/command prompt, and a new browser window should open with the Jupyter Notebook interface.

Create a new Python notebook by clicking on the "New" button and selecting "Python 3".

3. Importing Libraries

In this lab, we will use the Natural Language Toolkit (nltk) library. To install it, open a new cell in your Jupyter Notebook and run the following:

python

Copy code

!pip install nltk

Now, import the necessary libraries:

python

Copy code

import nltk

import random

from nltk.util import ngrams

from collections import defaultdict, Counter

4. Preparing the Dataset

For our simple LLM, we will use a sample text. You can replace this with your own dataset if desired. Paste the following code in a new cell:

python

Copy code

sample_text = """

Once upon a time, in a land far, far away, there lived a king and queen who had a beautiful daughter. The princess was kind and gentle, and everyone loved her.

"""

5. Tokenization

Tokenization is the process of breaking a text into individual words or tokens.

We will use the nltk.word_tokenize() function to tokenize our sample text. Run the following code:

python

Copy code

nltk.download('punkt')

tokens = nltk.word_tokenize(sample_text.lower())

print(tokens)

6. N-Gram Model

An N-gram is a contiguous sequence of n items from a given sample of text.

We will create a simple bigram model (n=2) for our LLM. Run the following code in a new cell:

python

Copy code

bigrams = list(ngrams(tokens, 2))

bigram_freq = defaultdict(Counter)

for w1, w2 in bigrams:

bigram_freq[w1][w2] += 1

print(bigram_freq)

This code creates a dictionary of bigrams and their frequencies.

7. Generating Text

Now that we have our bigram model, we can use it to generate text. Run the following code in a new cell:

python

Copy code

def generate_text(seed, n_words):

result = [seed]

for _ in range(n_words):

next_word_options = bigram_freq[result[-1]]

next_word = random.choices(list(next_word_options.keys()), list(next_word_options.values()))[0]

result.append(next_word)

return ' '.join(result)

generated_text = generate_text('princess', 5)

print(generated_text)

This code defines a function generate_text() that accepts a seed word and generates a sequence of words using the bigram model.

8. Conclusion

Congratulations! You have successfully built the simplest LLM using Jupyter Notebook. This basic language model demonstrates the core concepts of NLP, including tokenization and n-grams.

Although simple, it can be expanded and improved for more complex applications. Keep experimenting and learning to enhance your NLP skills!

Expanding Your Simplest LLM with Jupyter Notebook

In this tutorial, we will build upon the simplest LLM we created previously. We will show you how to add more text to your model, train it, and ask more questions to get better answers. We'll cover the following steps:

Set up Jupyter Notebook

Import necessary libraries

Prepare the dataset

Tokenize the text

Create a trigram model

Train the model with more text

Generate text with various questions

1. Set up Jupyter Notebook

Follow the same steps as in the previous tutorial to set up Jupyter Notebook.

2. Import necessary libraries

python

Copy code

import nltk

import random

from nltk import word_tokenize, sent_tokenize

from nltk.lm import MLE

from nltk.lm.preprocessing import padded_everygram_pipeline

3. Prepare the dataset

Load your dataset and combine it with new text data. Make sure the new text is clean and well-formatted.

python

Copy code

old_text = "your_previous_text_data"

new_text = "your_new_text_data"

combined_text = old_text + " " + new_text

4. Tokenize the text

Tokenize the combined text into sentences and words.

python

Copy code

sent_tokens = sent_tokenize(combined_text)

word_tokens = [word_tokenize(t) for t in sent_tokens]

5. Create a trigram model

We'll use a trigram model this time, which considers three words at a time, to improve the model's performance.

python

Copy code

n = 3

train_data, padded_sents = padded_everygram_pipeline(n, word_tokens)

6. Train the model with more text

Instantiate the MLE model and fit it with the training data.

python

Copy code

model = MLE(n)

model.fit(train_data, padded_sents)

7. Generate text with various questions

Now, you can ask more questions and generate text based on different input words or phrases.

python

Copy code

def generate_text(prompt, num_words, model):

word_list = model.generate(num_words, text_seed=prompt.split())

response = ' '.join(word_list)

return response

# Example questions

questions = [

"What is the importance",

"How does it work",

"What are the benefits",

"How can I improve",

"What should I consider"

]

for question in questions:

print(f"Question: {question}")

print(f"Answer: {generate_text(question, 20, model)}")

print("\n")

This expanded LLM will provide more accurate and diverse answers based on the larger dataset. Continue experimenting with different datasets, model architectures, and training techniques to further enhance your NLP skills.

Now let'provide a sample Jupyter Notebook tutorial on creating a simple LLM (language model) using the nltk library.

This tutorial includes setting up the environment, importing necessary libraries, preparing the dataset, tokenizing the text, creating a bigram model, training the model with more text, and generating text based on user inputs.

Set up the environment:

Install Jupyter Notebook if you haven't already.

Create a new Jupyter Notebook in your desired directory.

Import necessary libraries:

python

Copy code

import nltk

from nltk import bigrams, FreqDist

from nltk.util import ngrams

from nltk.lm.preprocessing import padded_everygram_pipeline

from nltk.lm import MLE

from random import choice

Prepare the dataset:

python

Copy code

# Sample text data

text = "Natural language processing is a subfield of linguistics, computer science, and artificial intelligence \

concerned with the interactions between computers and human language. In particular, it focuses on programming \

computers to process and analyze large amounts of natural language data."

# Tokenize the text

tokens = nltk.word_tokenize(text)

Create a bigram model:

python

Copy code

# Generate bigrams and their frequency distribution

bigrams = list(ngrams(tokens, 2))

bigram_freq_dist = FreqDist(bigrams)

# Prepare the dataset for training

train_data, padded_sents = padded_everygram_pipeline(2, tokens)

Train the model:

python

Copy code

# Train the bigram model

model = MLE(2)

model.fit(train_data, padded_sents)

Generate text based on user inputs:

python

Copy code

def generate_sentence(model, num_words, seed_word):

sentence = [seed_word]

for _ in range(num_words - 1):

next_word = model.generate(1, text_seed=sentence)

sentence.append(next_word)

return ' '.join(sentence)

# Example questions to the model

questions = [

"What is natural language processing?",

"How does artificial intelligence relate to linguistics?",

"Can computers understand human language?",

]

# Generate answers for the questions

for question in questions:

tokens = nltk.word_tokenize(question)

seed_word = choice(tokens)

generated_sentence = generate_sentence(model, 10, seed_word)

print(f"Q: {question}\nA: {generated_sentence}\n")

Feel free to experiment with different datasets, model architectures, and training techniques to further enhance your NLP skills using Jupyter Notebook.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.

Building the Simplest LLM with Jupyter Notebook: A Student's Guide

Learning outcomes:

Introduction:

Table of Contents

1. Introduction to Language Models

2. Setting up Jupyter Notebook

3. Importing Libraries

4. Preparing the Dataset

5. Tokenization

6. N-Gram Model

7. Generating Text

8. Conclusion

1. Set up Jupyter Notebook

2. Import necessary libraries

3. Prepare the dataset

4. Tokenize the text

5. Create a trigram model

6. Train the model with more text

7. Generate text with various questions