s24 AML3304 Assignment Build your own Word Embedding Due JULY 5

High-level Outline of Assignment Deliverables:

You can work in teams of up to 4
What to do:
Create an Embedding - In your Project, you will use this Embedding to make your Project which is an AI Language Model.

How you will do this:
Using either GCN — or Hugging Face Spaces — to make a Python Program which creates and Embedding.

How to hand this in:
Create a text file: named as TeamName.txt
I will provide a Dropbox link for your to upload this text file to:
Into this file, put:
Team members’ names and studentIDs
URL of your Trello Board.


As a preview to what will happen with the Project:

Your project will involve making a fully functional AI Language model.
Your project delivery mechanism will be to host your MODEL on a Model Delivery Platform.

The hand in mechanism for this Assignment is the URL for your Google Collab Notebook.
or HuggingFace Space URL which you will put into your TRELLO Board.

Hand-in mechanism:
Make a text file:
Hand in URL
Team name and members’ information (name, student ID)
TRELLO Board for the Team: Provide the URL to your TRELLO Board: Your links to your deployment platform will go into the TRELLO board where I can access it.

I will provide a Dropbox Link to upload that Text File to.

You will send me the URL of that Notebook by putting it into a TEXT file and uploading that text file.

In addition to the GC Notebook/HuggingFace Space with the code, we will add one more layer to this Assignment which will put the spotlight one HOW Work embeddings function in the AI ML Model to fine-tune our model on the Business Domain of the company.

Here is a Case Study in Deep Brew. We are presenting this now to start work on the PROJECT which will be to build your own Deep Brew-like product.


In a text cell in your Google Collab Notebook (up at the top), to discuss your team’s research on these topics:

How does PyTorch facilitate the creation of neural network models in the context of personalized marketing initiatives such as Deep Brew?
Can you explain the role of PyTorch in training AI models to predict customer behavior, similar to Starbucks' "Deep Brew" initiative?
What are the fundamental steps involved in using PyTorch to build a basic neural network for personalized marketing applications (which is what Deep Brew is)?
How does the integration of PyTorch with Google Colab enhance the accessibility and usability of AI/ML tools for project development?
In what ways can PyTorch be leveraged to analyze and optimize real-time customer data for the implementation of personalized marketing strategies [which is our Starbucks Deep Brew to run the enterprise]?

These questions are the connector between your Assignment and the Project:

How will you hand in this Assignment:
This is your DROPBOX Upload Location for the Team Information Text File
Upload links will be provided

The work product of Assignment 1 is:

To make a Word Embedding. Remember that the Embedding is the data structure which is the center of your AI Model. It is the box into we put the tokens and weightings of our Model.
The learning outcome of this assignment is to get a visual, a hands-on, of this Embedding Data Structure in PYTHON CODE, in the Google Collab Notebook or HuggingFace Space lab.

For the assignment, it is enough to just train our embedding on a couple of sentences hand coded into a Cell in the Google Collab Notebook. {Later, for your PROJECT, I will show you how to train your EMBEDDING on some training data set.}
You can work in teams or you can work by yourself:

Do this ONE per team: Send me a TEXT file. Name that text file as TeamName.txt
Into that text file, put the following details:
The names and student IDs of your team members. {1 to 4 members per team}
Let’s make a GOOGLE Collab Notebook:
{This is a Google product so log in with your standard Gmail email and password}


How to do the Assignment:
Assignment is going to evolve into the Project.
For the Assignment: You will make an Embedding and run a small scale AI language model on it.

Leibnitz tried to create a mathematical system to express human thinking.
An AI Language Model is - in terms of Python and the NLTK - the PYTORCH TENSOR FILE.
When you make an AI MODEL:the PYTORCH TENSOR FILE.

Read this and see which of your ideas you can present into your Project Presentation. [Due for January 23 class]
This talks about software architecture for the AI MODEL:
How Embeddings Work: (Making an Embedding is your assignment):

In this assignment you will create a word embedding which you will roll into your Project, due at the end of the term.
Class Outline: Creating Word Embeddings from Scratch
1. Introduction (10 minutes)
Lecture Notes:
1.1 Brief Overview of the Day's Objectives (3 minutes)
Learning Outcomes:
What word embeddings are
Why they're crucial in NLP : Natural Langugage processing.
Hands-on experience creating them from scratch.
1.2 Relevance of Word Embeddings in the World of NLP (3 minutes)
Language is inherently complex. \
When we communicate, we don't just exchange words; we exchange meanings, emotions, and intentions.
What are the goals of the generative AI Language Model:
Contextually nuanced. We want to convey the impression that the Language Model is emotionally empathetic and concerned for the well being of the respondent it is conversing with.

For machines to understand language, they need a way to capture this richness. This is where word embeddings come in."
"Imagine trying to teach a computer the meaning of the word 'king'. You could provide a dictionary definition, but that's just more words. Instead, what if you could represent 'king' as a point in space, close to 'queen' but far from 'apple'? This spatial representation, where words with similar meanings are closer together, is the essence of word embeddings."
1.3 How Word Embeddings Serve as a Foundation for Generative AI Language Models (4 minutes)
"Word embeddings are more than just a neat trick. They're foundational for advanced NLP tasks. When we talk about AI models that can write essays, answer questions, or even generate poetry, we're talking about models that, at their core, understand the relationships between words. And this understanding starts with word embeddings."
"Think of word embeddings as the base layer of a skyscraper. On their own, they might seem like a flat expanse. But they provide the stability and foundation upon which we can build towering structures. In our case, these 'towers' are generative AI language models."
Python Code:
For this section, there isn't any direct Python code since it's an introductory segment. However, you can provide a visual representation of word embeddings to make the concept clearer. Here's a simple code using gensim and matplotlib to visualize pre-trained word embeddings:

# Ensure that you have the necessary libraries installed
!pip install gensim matplotlib

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model
word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize
words = ["king", "queen", "man", "woman", "apple", "banana", "computer", "laptop"]

# Get the vectors for the defined words
vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

# Plot the words in 2D space
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]))

"Here's a simple visualization of some word embeddings. Notice how 'king' and 'queen' are closer to each other than, say, 'king' and 'apple'. This spatial closeness represents semantic similarity."
This introduction sets the stage for the rest of the session, ensuring students understand the importance and relevance of word embeddings in the broader context of NLP and AI.
. Foundational Concepts (15 minutes)
Lecture Notes:
1.1 Quick Recap of Foundational NLP Concepts (5 minutes)
"Before we dive into word embeddings, let's revisit some foundational concepts in NLP that you're already familiar with."
"Tokenization is the process of breaking down a text into smaller chunks, typically words or subwords. It's our first step in converting human language into something a machine can understand."
"After tokenizing, we gather all unique tokens to form our vocabulary. This vocabulary serves as a reference, allowing us to represent any word from our dataset."
One-shot Encoding:
"Once we have our vocabulary, how do we represent our tokens numerically? One method is one-hot encoding. For each word, we create a vector of the size of our vocabulary, with a '1' at the index of our word and '0's elsewhere. While straightforward, this method can be inefficient for large vocabularies and doesn't capture word relationships."
Python Code: Tokenization and One-hot Encoding
pythonCopy code
import numpy as np

# Sample text
text = "Artificial intelligence is the future of computing."

# Tokenization
tokens = text.lower().split()
print(f"Tokens: {tokens}")

# Create Vocabulary
vocab = set(tokens)
vocab_size = len(vocab)
print(f"Vocabulary: {vocab}")

# One-hot Encoding
word_to_index = {word: i for i, word in enumerate(vocab)}
one_hot_vectors = np.zeros((len(tokens), vocab_size))

for i, token in enumerate(tokens):
one_hot_vectors[i, word_to_index[token]] = 1

print("\nOne-hot Vectors:")

1.2 Introduction to the Concept of Embeddings (4 minutes)
Lecture Notes:
"Let's take a moment to understand the limitations of one-hot encoding. Imagine a vocabulary of 10,000 words. Each word is represented as a vector with one '1' and 9,999 '0's. This representation is sparse, meaning it's mostly filled with zeros. Moreover, every word is equidistant from every other word, so there's no notion of similarity."
"This is where embeddings come into play. Instead of representing words in a high-dimensional space where each word is isolated, we represent them in a lower-dimensional space where semantically similar words are closer together."
"Imagine a space where the word 'king' is close to 'queen' but far from 'apple'. This spatial representation is the essence of word embeddings. They provide a dense, continuous representation where the position and distance between vectors capture semantic meaning."
Python Code: Visualization of Word Embeddings
For this section, we'll use the gensim library to load pre-trained word embeddings and visualize them.
pythonCopy code
# Ensure students have the necessary libraries installed
!pip install gensim matplotlib

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model
word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize
words = ["king", "queen", "man", "woman", "apple", "banana"]

# Get the vectors for the defined words
vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization using PCA
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

# Plot the words in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(result[:, 0], result[:, 1], s=100, c='blue', edgecolors='k')
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize=12)
plt.title("Visualization of Word Embeddings")

"As you can see in this visualization, words with similar meanings or contexts are closer together. This is the power of word embeddings. They capture semantic relationships in a dense vector space, making them invaluable for various NLP tasks."
This section provides students with a clear understanding of why word embeddings are preferred over one-hot encodings and how they capture semantic relationships between words. The combination of lecture notes and Python code offers both a theoretical foundation and a practical demonstration.
1.3 Dimensionality Reduction and Its Importance (6 minutes)
Lecture Notes:
"Dimensionality reduction is a cornerstone concept in machine learning and data science. In simple terms, it's about representing data in a reduced form without losing much information. But why is it so crucial?"
"Consider our earlier example: a vocabulary of 10,000 words. With one-hot encoding, we're dealing with vectors of size 10,000 for each word. Not only is this inefficient in terms of memory, but it also doesn't capture relationships between words. Every word is equally distant from every other word."
"Dimensionality reduction techniques, like PCA (Principal Component Analysis), allow us to compress this information. Instead of 10,000 dimensions, we might represent our words in just 100 or 300 dimensions. This compression brings words with similar meanings closer together."
"Word embeddings are essentially a result of this idea. They are dense vectors, typically of size 100, 300, or even 768 for some advanced models, representing words in a way that captures semantic relationships."
Python Code: Visualization of Dimensionality Reduction
pythonCopy code
# Ensure students have the necessary libraries installed
!pip install matplotlib numpy

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Assuming we have some dummy embeddings for visualization
dummy_embeddings = {
'king': [1.2, 2.1, 0.8, 0.5],
'queen': [1.1, 2.0, 0.7, 0.55],
'man': [0.5, 0.7, 0.1, 0.2],
'woman': [0.45, 0.65, 0.05, 0.25]

# Reduce dimensions for visualization using PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(list(dummy_embeddings.values()))

# Plot the words in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], s=100, c='blue', edgecolors='k')
for i, word in enumerate(dummy_embeddings.keys()):
plt.annotate(word, xy=(reduced[i, 0], reduced[i, 1]), fontsize=12)
plt.title("Dimensionality Reduction using PCA")

"As you can see, even with this simple example, words that we intuitively understand to be related, like 'king' and 'queen', are closer together in this reduced space. This proximity allows our models to understand and leverage the semantic relationships between words, making them more effective in tasks like text classification, sentiment analysis, and more."
This section provides students with a clear understanding of why dimensionality reduction is essential, especially in the context of NLP. The lecture notes and Python code together offer a comprehensive overview of the topic, balancing theoretical insights with practical demonstration.
Word Embeddings: Deep Dive (20 minutes)
What are word embeddings?
How do they capture semantic relationships?
Visualization: Showing pre-trained embeddings on a 2D plane using t-SNE.
Building Word Embeddings: Techniques (20 minutes)
Count-based methods: Co-occurrence matrix, Singular Value Decomposition (SVD).
Prediction-based methods: Introduction to Word2Vec (Skip-Gram and CBOW).
Brief mention of other methods: FastText, GloVe.
Hands-on Activity: Implementing Word2Vec from Scratch (40 minutes)
Setting up the environment: Necessary libraries and dataset.
Preprocessing: Tokenization, creating vocabulary.
Implementing Skip-Gram model: Defining the neural network architecture.
Training the model: Using a small corpus for demonstration.
Visualizing the learned embeddings.
Evaluation of Word Embeddings (15 minutes)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.