Share
Explore

AML3304 Assignment: Build your own Word Embedding

Integration of Git Sprints, Issues and Actions:

Submission:
- Instructions will be provided by the instructor
- Check13
This assignment is group work. A portion of the Student’s grade will be based on Interview questions with the Instructor..
1. Create Slack Team (i.e., workspace) for your lab group
2. Create GitHub private repo for your group
3. [5%] Add ZenHub shell to your GitHub repo
4. [5%] Integrate ZenHub notifications into Slack and create #zenhub slack channel for the notifications
5. [5%] Integrate GitHub notifications into Slack and create #github slack channel for the notifications
6. [5%] Create two Epics with two issues/requirements in each in ZenHub
7. [5%] Add estimates to each issue (select estimate values at random)
8. [5%] Create
a. “Sprint 1” starting on xxx, and ending on yyy, and
b. “Sprint 2” starting on xxx and ending on yyy.
9. [5%] For each Epic, assign one issue/requirement to Sprint 1 and the other one -- to Sprint 2.
10. [5%] Close first issue in Sprint 1
11. [5%] Close Sprint 1
12. [5%] Add users msi-ru-cs to GitHub account.
13. [5%] Send an invite to your slack group to “Instructor email”
14. [10%] Build Slack bot echoing questions asked to the bot, add it to your Slack group
a. You will have to run your bot from your machine so that we can validate its functionality
15. [5%] Commit the source code of your Slack bot to GitHub repository
16. [5%] Create file ./git_test/index.html with some content
17. [10%] Create two additional branches and modify./git_test/index.html in both of these branches in such a way that a merge conflict will be provoked. Merge both branches into the master branch and resolve the conflict.
18. [5%] While committing the code in step 17, close one of the issues created in step 6 using commit message.
19. [10%] Create a pull request with a branch name aml3304-pull-1 for some modification of
./git_test/index.html; do the code review (by commenting on the commit in the pull request); merge and close the commit.
How to do the Assignment:
Assignment is going to evolve into the Project.
For the Assignment: You will make an Embedding and run a small scale AI language model on it.
For the Project: You will create a fully featured AI Language Model: And generate this AI Model with a fully developed CI CD Pipeline run in Azure.
** For your Project: You will try to connect with a local area company here in Ontario to find out problems they have and see if you can design an LLM to address problems that they have.

In this book, Davis talks how Leibnitz tried to create a mathematical system to express human thinking.
An AI Language Model is - in terms of Python and the NLTK - the PYTORCH TENSOR FILE.
When you make an AI MODEL:the PYTORCH TENSOR FILE.
image.png

image.png
Read this and see which of your ideas you can present into your Project Presentation. [Due for January 23 class]
This talks about software architecture for the AI MODEL:
How Embeddings Work: (Making an Embedding is your assignment):

Due Week 7
In this assignment you will create a word embedding which you will roll into your Project, due at the end of the term.
Class Outline: Creating Word Embeddings from Scratch
1. Introduction (10 minutes)
Lecture Notes:
1.1 Brief Overview of the Day's Objectives (3 minutes)
Learning Outcomes:
What word embeddings are
Why they're crucial in NLP : Natural Langugage processing.
Hands-on experience creating them from scratch.
1.2 Relevance of Word Embeddings in the World of NLP (3 minutes)
Language is inherently complex. \
When we communicate, we don't just exchange words; we exchange meanings, emotions, and intentions.
What are the goals of the generative AI Language Model:
Contextually nuanced. We want to convey the impression that the Language Model is emotionally empathetic and concerned for the well being of the respondent it is conversing with.

For machines to understand language, they need a way to capture this richness. This is where word embeddings come in."
"Imagine trying to teach a computer the meaning of the word 'king'. You could provide a dictionary definition, but that's just more words. Instead, what if you could represent 'king' as a point in space, close to 'queen' but far from 'apple'? This spatial representation, where words with similar meanings are closer together, is the essence of word embeddings."
1.3 How Word Embeddings Serve as a Foundation for Generative AI Language Models (4 minutes)
"Word embeddings are more than just a neat trick. They're foundational for advanced NLP tasks. When we talk about AI models that can write essays, answer questions, or even generate poetry, we're talking about models that, at their core, understand the relationships between words. And this understanding starts with word embeddings."
"Think of word embeddings as the base layer of a skyscraper. On their own, they might seem like a flat expanse. But they provide the stability and foundation upon which we can build towering structures. In our case, these 'towers' are generative AI language models."
Python Code:
For this section, there isn't any direct Python code since it's an introductory segment. However, you can provide a visual representation of word embeddings to make the concept clearer. Here's a simple code using gensim and matplotlib to visualize pre-trained word embeddings:
pythonCopy code
# Ensure students have the necessary libraries installed
!pip install gensim matplotlib

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model
word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize
words = ["king", "queen", "man", "woman", "apple", "banana", "computer", "laptop"]

# Get the vectors for the defined words
vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

# Plot the words in 2D space
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()

"Here's a simple visualization of some word embeddings. Notice how 'king' and 'queen' are closer to each other than, say, 'king' and 'apple'. This spatial closeness represents semantic similarity."
This introduction sets the stage for the rest of the session, ensuring students understand the importance and relevance of word embeddings in the broader context of NLP and AI.
. Foundational Concepts (15 minutes)
Lecture Notes:
1.1 Quick Recap of Foundational NLP Concepts (5 minutes)
"Before we dive into word embeddings, let's revisit some foundational concepts in NLP that you're already familiar with."
Tokenization:
"Tokenization is the process of breaking down a text into smaller chunks, typically words or subwords. It's our first step in converting human language into something a machine can understand."
Vocabulary:
"After tokenizing, we gather all unique tokens to form our vocabulary. This vocabulary serves as a reference, allowing us to represent any word from our dataset."
One-hot Encoding:
"Once we have our vocabulary, how do we represent our tokens numerically? One method is one-hot encoding. For each word, we create a vector of the size of our vocabulary, with a '1' at the index of our word and '0's elsewhere. While straightforward, this method can be inefficient for large vocabularies and doesn't capture word relationships."
Python Code: Tokenization and One-hot Encoding
pythonCopy code
import numpy as np

# Sample text
text = "Artificial intelligence is the future of computing."

# Tokenization
tokens = text.lower().split()
print(f"Tokens: {tokens}")

# Create Vocabulary
vocab = set(tokens)
vocab_size = len(vocab)
print(f"Vocabulary: {vocab}")

# One-hot Encoding
word_to_index = {word: i for i, word in enumerate(vocab)}
one_hot_vectors = np.zeros((len(tokens), vocab_size))

for i, token in enumerate(tokens):
one_hot_vectors[i, word_to_index[token]] = 1

print("\nOne-hot Vectors:")
print(one_hot_vectors)

1.2 Introduction to the Concept of Embeddings (4 minutes)
Lecture Notes:
"Let's take a moment to understand the limitations of one-hot encoding. Imagine a vocabulary of 10,000 words. Each word is represented as a vector with one '1' and 9,999 '0's. This representation is sparse, meaning it's mostly filled with zeros. Moreover, every word is equidistant from every other word, so there's no notion of similarity."
"This is where embeddings come into play. Instead of representing words in a high-dimensional space where each word is isolated, we represent them in a lower-dimensional space where semantically similar words are closer together."
"Imagine a space where the word 'king' is close to 'queen' but far from 'apple'. This spatial representation is the essence of word embeddings. They provide a dense, continuous representation where the position and distance between vectors capture semantic meaning."
Python Code: Visualization of Word Embeddings
For this section, we'll use the gensim library to load pre-trained word embeddings and visualize them.
pythonCopy code
# Ensure students have the necessary libraries installed
!pip install gensim matplotlib

import gensim.downloader as api
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model
word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize
words = ["king", "queen", "man", "woman", "apple", "banana"]

# Get the vectors for the defined words
vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization using PCA
pca = PCA(n_components=2)
result = pca.fit_transform(vectors)

# Plot the words in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(result[:, 0], result[:, 1], s=100, c='blue', edgecolors='k')
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize=12)
plt.title("Visualization of Word Embeddings")
plt.grid(True)
plt.show()

"As you can see in this visualization, words with similar meanings or contexts are closer together. This is the power of word embeddings. They capture semantic relationships in a dense vector space, making them invaluable for various NLP tasks."
This section provides students with a clear understanding of why word embeddings are preferred over one-hot encodings and how they capture semantic relationships between words. The combination of lecture notes and Python code offers both a theoretical foundation and a practical demonstration.
1.3 Dimensionality Reduction and Its Importance (6 minutes)
Lecture Notes:
"Dimensionality reduction is a cornerstone concept in machine learning and data science. In simple terms, it's about representing data in a reduced form without losing much information. But why is it so crucial?"
"Consider our earlier example: a vocabulary of 10,000 words. With one-hot encoding, we're dealing with vectors of size 10,000 for each word. Not only is this inefficient in terms of memory, but it also doesn't capture relationships between words. Every word is equally distant from every other word."
"Dimensionality reduction techniques, like PCA (Principal Component Analysis), allow us to compress this information. Instead of 10,000 dimensions, we might represent our words in just 100 or 300 dimensions. This compression brings words with similar meanings closer together."
"Word embeddings are essentially a result of this idea. They are dense vectors, typically of size 100, 300, or even 768 for some advanced models, representing words in a way that captures semantic relationships."
Python Code: Visualization of Dimensionality Reduction
pythonCopy code
# Ensure students have the necessary libraries installed
!pip install matplotlib numpy

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Assuming we have some dummy embeddings for visualization
dummy_embeddings = {
'king': [1.2, 2.1, 0.8, 0.5],
'queen': [1.1, 2.0, 0.7, 0.55],
'man': [0.5, 0.7, 0.1, 0.2],
'woman': [0.45, 0.65, 0.05, 0.25]
}

# Reduce dimensions for visualization using PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(list(dummy_embeddings.values()))

# Plot the words in 2D space
plt.figure(figsize=(8, 6))
plt.scatter(reduced[:, 0], reduced[:, 1], s=100, c='blue', edgecolors='k')
for i, word in enumerate(dummy_embeddings.keys()):
plt.annotate(word, xy=(reduced[i, 0], reduced[i, 1]), fontsize=12)
plt.title("Dimensionality Reduction using PCA")
plt.grid(True)
plt.show()

"As you can see, even with this simple example, words that we intuitively understand to be related, like 'king' and 'queen', are closer together in this reduced space. This proximity allows our models to understand and leverage the semantic relationships between words, making them more effective in tasks like text classification, sentiment analysis, and more."
This section provides students with a clear understanding of why dimensionality reduction is essential, especially in the context of NLP. The lecture notes and Python code together offer a comprehensive overview of the topic, balancing theoretical insights with practical demonstration.
Word Embeddings: Deep Dive (20 minutes)
What are word embeddings?
How do they capture semantic relationships?
Visualization: Showing pre-trained embeddings on a 2D plane using t-SNE.
Building Word Embeddings: Techniques (20 minutes)
Count-based methods: Co-occurrence matrix, Singular Value Decomposition (SVD).
Prediction-based methods: Introduction to Word2Vec (Skip-Gram and CBOW).
Brief mention of other methods: FastText, GloVe.
Hands-on Activity: Implementing Word2Vec from Scratch (40 minutes)
Setting up the environment: Necessary libraries and dataset.
Preprocessing: Tokenization, creating vocabulary.
Implementing Skip-Gram model: Defining the neural network architecture.
Training the model: Using a small corpus for demonstration.
Visualizing the learned embeddings.
Evaluation of Word Embeddings (15 minutes)
Intrinsic evaluation: Using analogy tasks (e.g., king - man + woman = ?).
Extrinsic evaluation: How embeddings can improve performance on downstream tasks.
From Word Embeddings to Generative AI Models (15 minutes)
Brief overview of generative models in NLP.
How word embeddings serve as an input layer for models like RNNs, LSTMs, and transformers.
Teaser for the next classes: Building on these embeddings to create generative models.
Assignment Discussion (10 minutes)
Task: Create word embeddings for a specific dataset/domain.
Evaluation criteria: Quality of embeddings, visualization, and insights derived.
Importance of this assignment as a precursor to the end-of-term project.
Q&A Session (10 minutes)
Addressing any doubts or questions from students.
Encouraging students to think critically about the potential and limitations of word embeddings.
Conclusion and Next Steps (5 minutes)
Recap of the day's learnings.
Assigning further readings and resources.
Setting expectations for the next class.
This outline provides a balanced mix of theory, practical implementation, and forward-looking insights. It ensures that students not only understand the concept of word embeddings but also get hands-on experience, setting them up perfectly for more advanced topics in the future.
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.