Creating Word Embeddings and Building an AI Language Model in Google Colab
Part 1: Introduction
Welcome to the lab session where we will delve into the world of word embeddings and AI language models.
This lab is designed to give you a hands-on experience in creating word embeddings using popular techniques and utilizing these embeddings to build and train an AI language model.
We will use Google Colab for this lab to leverage its computational resources and ease of use.
Objective
By the end of this lab, you will be able to:
1. Understand the concept and importance of word embeddings.
2. Create word embeddings using Word2Vec.
3. Build a simple AI language model using the generated word embeddings.
4. Train the language model on a corpus of text.
We will be using the standard Guttenberg Corpus:
5. Evaluate and visualize the word embeddings.
Tools:
To achieve our objectives, we will use the following tools:
1. **Google Colab**: An online platform that provides free access to GPUs and TPUs, making it ideal for running machine learning tasks.
2. **Python**: The primary programming language for this lab.
3. **TensorFlow/Keras**: Popular libraries for building and training machine learning models.
4. **NLTK (Natural Language Toolkit)**: A library for working with human language data (text).
5. **Gensim**:
A library for topic modeling and document similarity analysis, used here for creating word embeddings.
Step-by-Step Guide
1. Setting Up Google Colab
- Open Google Colab by navigating to [colab.research.google.com](https://colab.research.google.com).
- Create a new notebook by clicking on **File > New Notebook**.
2. Installing Necessary Libraries
In your Google Colab notebook, you need to install the required libraries. Run the following commands to install `nltk`, `gensim`, and `tensorflow`:
```python
!pip install nltk gensim tensorflow
```
3. Importing Libraries
Import the libraries that you will be using throughout the lab:
```python
import nltk
import gensim
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
```
4. Downloading and Preparing the Data
We will use a text corpus from the NLTK library for this lab.
Let's download and prepare the data:
```python
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
# Load the corpus
corpus = gutenberg.raw('austen-emma.txt')
# Tokenize the text
words = word_tokenize(corpus.lower())
# Remove stopwords and non-alphanumeric tokens
stop_words = set(stopwords.words('english'))
words = [word for word in words if word.isalnum() and word not in stop_words]
```
5. Creating Word Embeddings Using Word2Vec
We will use the Word2Vec model from the Gensim library to create word embeddings:
```python
# Create Word2Vec model
word2vec_model = gensim.models.Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, workers=4)
# Save the model for later use
word2vec_model.save("word2vec.model")
```
6. Visualizing Word Embeddings
To better understand the word embeddings, we will visualize them using t-SNE:
```python
def plot_embeddings(model):
labels = []
tokens = []
for word in model.wv.index_to_key:
tokens.append(model.wv[word])
labels.append(word)
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
plot_embeddings(word2vec_model)
```
Fix the error in the above code:
Here are the fixes to the code so it runs properly:
1. Import necessary modules from gensim.
2. Initialize the Word2Vec model correctly.
3. Ensure the Word2Vec model is trained on the tokenized words.
4. Pass the correct model to the `plot_embeddings` function.
Here's the corrected code:
import nltk
import gensim
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# Download necessary NLTK data
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('stopwords')
# Load the corpus
corpus = gutenberg.raw('austen-emma.txt')
# Tokenize the text
words = word_tokenize(corpus.lower())
# Remove stopwords and non-alphanumeric tokens
stop_words = set(stopwords.words('english'))
words = [word for word in words if word.isalnum() and word not in stop_words]
# Train a Word2Vec model
word2vec_model = gensim.models.Word2Vec(sentences=[words], vector_size=100, window=5, min_count=1, workers=4)
def plot_embeddings(model):
labels = []
tokens = []
for word in model.wv.index_to_key:
tokens.append(model.wv[word])
labels.append(word)
tokens = np.array(tokens) # Convert to NumPy array
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i], y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
# Plot the embeddings
plot_embeddings(word2vec_model)
```
This code ensures the Word2Vec model is properly trained and the embeddings are visualized using t-SNE.***
Yeah! Congratulations! You have now made an embedding and used a PYTHON plot visualization to see what it looks like!
Here is the completed GCN:
Explanation of Key Concepts
Word Embeddings
Word embeddings are dense vector representations of words in a continuous vector space where words that have similar meanings are positioned close to each other. This representation captures the semantic meaning of words, enabling machine learning models to understand and process text more effectively.
#### **Mathematical Background**
1. **Vectors**: Words are represented as vectors in a high-dimensional space.
2. **Dot Product**: The similarity between two words can be measured using the dot product of their vectors. A higher dot product indicates higher similarity.
3. **Optimization**: Word embeddings are typically learned by optimizing a loss function that aims to position similar words closer together in the vector space.
#### **Word2Vec**
Word2Vec is a popular algorithm for generating word embeddings. It comes in two flavors:
- **Skip-gram**: Given a word, predict the surrounding context words.
- **CBOW (Continuous Bag of Words)**: Given a context, predict the target word.
In this lab, we use the skip-gram model to generate embeddings, as it generally provides better representations for smaller datasets.
---
This concludes the introduction part of the lab. In the next parts, we will build the AI language model using the generated word embeddings, train it on the prepared corpus, and evaluate its performance.