Explore

s24 AML3304 Assignment Build your own Word Embedding Due JULY 5

⁠

High-level Outline of Assignment Deliverables:

You can work in teams of up to 4

What to do:

Create an Embedding - In your Project, you will use this Embedding to make your Project which is an AI Language Model.

How you will do this:

Using either GCN — or Hugging Face Spaces — to make a Python Program which creates and Embedding.

How to hand this in:

Create a text file: named as TeamName.txt

I will provide a Dropbox link for your to upload this text file to:

Into this file, put:

Team members’ names and studentIDs

URL of your Trello Board.

As a preview to what will happen with the Project:

Your project will involve making a fully functional AI Language model.

Your project delivery mechanism will be to host your MODEL on a Model Delivery Platform.

⁠

https://coda.io/@peter-sigurdson/dedicated-machine-learning-ml-platforms⁠

⁠

The hand in mechanism for this Assignment is the URL for your Google Collab Notebook.

or HuggingFace Space URL which you will put into your TRELLO Board.

Hand-in mechanism:

Make a text file:

Hand in URL for Trello Board goes into your Text File

Team name and members’ information (name, student ID)

Into the TRELLO Board for the Team:

Provide the URL to your GCN:

Your links to your deployment platform GCN will go into the TRELLO board where I can access it.

I will provide a Dropbox Link to upload that Text File to.

You will send me the URL of that Notebook by putting it into a TEXT file and uploading that text file.

In addition to the GC Notebook/HuggingFace Space with the code, we will add one more layer to this Assignment which will put the spotlight one HOW Work embeddings function in the AI ML Model to fine-tune our model on the Business Domain of the company.

Here is a Case Study in Deep Brew. We are presenting this now to start work on the PROJECT which will be to build your own Deep Brew-like product.

⁠

https://coda.io/@peter-sigurdson/building-deep-brew-the-mvp⁠

In a text cell in your Google Collab Notebook (up at the top), to discuss your team’s research on these topics:

How does PyTorch facilitate the creation of neural network models in the context of personalized marketing initiatives such as Deep Brew?

Can you explain the role of PyTorch in training AI models to predict customer behavior, similar to Starbucks' "Deep Brew" initiative?

What are the fundamental steps involved in using PyTorch to build a basic neural network for personalized marketing applications (which is what Deep Brew is)?

How does the integration of PyTorch with Google Colab enhance the accessibility and usability of AI/ML tools for project development?

In what ways can PyTorch be leveraged to analyze and optimize real-time customer data for the implementation of personalized marketing strategies [which is our Starbucks Deep Brew to run the enterprise]?

These questions are the connector between your Assignment and the Project:

⁠

https://coda.io/@peter-sigurdson/building-your-own-deep-brew⁠

⁠

https://www.linkedin.com/pulse/revolutionizing-workflow-nexacores-pioneering-ai-peter-sigurdson-vysqc?trackingId=up2dJq0hIFx7uGbQGrtp4w%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_recent_activity_content_view%3BmPBsOgIcRKmGur6dJ2UBcg%3D%3D⁠

⁠

How will you hand in this Assignment:

This is your DROPBOX Upload Location for the Team Information Text File

Upload links will be provided

The work product of Assignment 1 is:

To make a Word Embedding. Remember that the Embedding is the data structure which is the center of your AI Model. It is the box into we put the tokens and weightings of our Model.

The learning outcome of this assignment is to get a visual, a hands-on, of this Embedding Data Structure in PYTHON CODE, in the Google Collab Notebook or HuggingFace Space lab.

⁠

https://coda.io/@peter-sigurdson/ai-ml-model-engineering-with-word-embeddings⁠

For the assignment, it is enough to just train our embedding on a couple of sentences hand coded into a Cell in the Google Collab Notebook. {Later, for your PROJECT, I will show you how to train your EMBEDDING on some training data set.}

You can work in teams or you can work by yourself:

Do this ONE per team: Send me a TEXT file. Name that text file as TeamName.txt

Into that text file, put the following details:

The names and student IDs of your team members. {1 to 4 members per team}

Let’s make a GOOGLE Collab Notebook:

https://colab.research.google.com/⁠

{This is a Google product so log in with your standard Gmail email and password}

⁠

How to do the Assignment:

⁠

https://coda.io/d/_dck25H5SMMv/Lecture-Introduction-to-Building-AI-Model-Embeddings-with-Artifi_suFNO?searchClick=25f21667-581d-43e5-9ffe-6f4d3d498e88_ck25H5SMMv⁠

Assignment is going to evolve into the Project.

For the Assignment: You will make an Embedding and run a small scale AI language model on it.

⁠

Leibnitz tried to create a mathematical system to express human thinking.

An AI Language Model is - in terms of Python and the NLTK - the PYTORCH TENSOR FILE.

When you make an AI MODEL:the PYTORCH TENSOR FILE.

⁠

Read this and see which of your ideas you can present into your Project Presentation. [Due for January 23 class]

⁠

This talks about software architecture for the AI MODEL:

⁠

How Embeddings Work: (Making an Embedding is your assignment):

⁠

In this assignment you will create a word embedding which you will roll into your Project, due at the end of the term.

⁠

Class Outline: Creating Word Embeddings from Scratch

1. Introduction (10 minutes)

⁠

Lecture Notes:

⁠

1.1 Brief Overview of the Day's Objectives (3 minutes)

Learning Outcomes:

What word embeddings are

Why they're crucial in NLP : Natural Langugage processing.

Hands-on experience creating them from scratch.

⁠

1.2 Relevance of Word Embeddings in the World of NLP (3 minutes)

Language is inherently complex. \

When we communicate, we don't just exchange words; we exchange meanings, emotions, and intentions.

What are the goals of the generative AI Language Model:

Contextually nuanced. We want to convey the impression that the Language Model is emotionally empathetic and concerned for the well being of the respondent it is conversing with.

For machines to understand language, they need a way to capture this richness. This is where word embeddings come in."

"Imagine trying to teach a computer the meaning of the word 'king'. You could provide a dictionary definition, but that's just more words. Instead, what if you could represent 'king' as a point in space, close to 'queen' but far from 'apple'? This spatial representation, where words with similar meanings are closer together, is the essence of word embeddings."

⁠

1.3 How Word Embeddings Serve as a Foundation for Generative AI Language Models (4 minutes)

"Word embeddings are more than just a neat trick. They're foundational for advanced NLP tasks. When we talk about AI models that can write essays, answer questions, or even generate poetry, we're talking about models that, at their core, understand the relationships between words. And this understanding starts with word embeddings."

"Think of word embeddings as the base layer of a skyscraper. On their own, they might seem like a flat expanse. But they provide the stability and foundation upon which we can build towering structures. In our case, these 'towers' are generative AI language models."

⁠

Python Code:

For this section, there isn't any direct Python code since it's an introductory segment. However, you can provide a visual representation of word embeddings to make the concept clearer. Here's a simple code using gensim and matplotlib to visualize pre-trained word embeddings:

# Ensure that you have the necessary libraries installed

!pip install gensim matplotlib

import gensim.downloader as api

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model

word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize

words = ["king", "queen", "man", "woman", "apple", "banana", "computer", "laptop"]

# Get the vectors for the defined words

vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization

pca = PCA(n_components=2)

result = pca.fit_transform(vectors)

# Plot the words in 2D space

plt.scatter(result[:, 0], result[:, 1])

for i, word in enumerate(words):

plt.annotate(word, xy=(result[i, 0], result[i, 1]))

plt.show()

⁠

"Here's a simple visualization of some word embeddings. Notice how 'king' and 'queen' are closer to each other than, say, 'king' and 'apple'. This spatial closeness represents semantic similarity."

⁠

This introduction sets the stage for the rest of the session, ensuring students understand the importance and relevance of word embeddings in the broader context of NLP and AI.

. Foundational Concepts (15 minutes)

⁠

Lecture Notes:

⁠

1.1 Quick Recap of Foundational NLP Concepts (5 minutes)

⁠

"Before we dive into word embeddings, let's revisit some foundational concepts in NLP that you're already familiar with."

Tokenization:

"Tokenization is the process of breaking down a text into smaller chunks, typically words or subwords. It's our first step in converting human language into something a machine can understand."

Vocabulary:

"After tokenizing, we gather all unique tokens to form our vocabulary. This vocabulary serves as a reference, allowing us to represent any word from our dataset."

One-shot Encoding:

"Once we have our vocabulary, how do we represent our tokens numerically? One method is one-hot encoding. For each word, we create a vector of the size of our vocabulary, with a '1' at the index of our word and '0's elsewhere. While straightforward, this method can be inefficient for large vocabularies and doesn't capture word relationships."

⁠

Python Code: Tokenization and One-hot Encoding

pythonCopy code

import numpy as np

# Sample text

text = "Artificial intelligence is the future of computing."

# Tokenization

tokens = text.lower().split()

print(f"Tokens: {tokens}")

# Create Vocabulary

vocab = set(tokens)

vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")

# One-hot Encoding

word_to_index = {word: i for i, word in enumerate(vocab)}

one_hot_vectors = np.zeros((len(tokens), vocab_size))

for i, token in enumerate(tokens):

one_hot_vectors[i, word_to_index[token]] = 1

print("\nOne-hot Vectors:")

print(one_hot_vectors)

⁠

1.2 Introduction to the Concept of Embeddings (4 minutes)

⁠

Lecture Notes:

⁠

"Let's take a moment to understand the limitations of one-hot encoding. Imagine a vocabulary of 10,000 words. Each word is represented as a vector with one '1' and 9,999 '0's. This representation is sparse, meaning it's mostly filled with zeros. Moreover, every word is equidistant from every other word, so there's no notion of similarity."

"This is where embeddings come into play. Instead of representing words in a high-dimensional space where each word is isolated, we represent them in a lower-dimensional space where semantically similar words are closer together."

"Imagine a space where the word 'king' is close to 'queen' but far from 'apple'. This spatial representation is the essence of word embeddings. They provide a dense, continuous representation where the position and distance between vectors capture semantic meaning."

⁠

Python Code: Visualization of Word Embeddings

For this section, we'll use the gensim library to load pre-trained word embeddings and visualize them.

pythonCopy code

# Ensure students have the necessary libraries installed

!pip install gensim matplotlib

import gensim.downloader as api

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Load a pre-trained word embeddings model

word_vectors = api.load("glove-wiki-gigaword-100")

# Define some words to visualize

words = ["king", "queen", "man", "woman", "apple", "banana"]

# Get the vectors for the defined words

vectors = [word_vectors[word] for word in words]

# Reduce dimensions for visualization using PCA

pca = PCA(n_components=2)

result = pca.fit_transform(vectors)

# Plot the words in 2D space

plt.figure(figsize=(8, 6))

plt.scatter(result[:, 0], result[:, 1], s=100, c='blue', edgecolors='k')

for i, word in enumerate(words):

plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize=12)

plt.title("Visualization of Word Embeddings")

plt.grid(True)

plt.show()

⁠

"As you can see in this visualization, words with similar meanings or contexts are closer together. This is the power of word embeddings. They capture semantic relationships in a dense vector space, making them invaluable for various NLP tasks."

⁠

This section provides students with a clear understanding of why word embeddings are preferred over one-hot encodings and how they capture semantic relationships between words. The combination of lecture notes and Python code offers both a theoretical foundation and a practical demonstration.

⁠

1.3 Dimensionality Reduction and Its Importance (6 minutes)

⁠

Lecture Notes:

⁠

"Dimensionality reduction is a cornerstone concept in machine learning and data science. In simple terms, it's about representing data in a reduced form without losing much information. But why is it so crucial?"

"Consider our earlier example: a vocabulary of 10,000 words. With one-hot encoding, we're dealing with vectors of size 10,000 for each word. Not only is this inefficient in terms of memory, but it also doesn't capture relationships between words. Every word is equally distant from every other word."

"Dimensionality reduction techniques, like PCA (Principal Component Analysis), allow us to compress this information. Instead of 10,000 dimensions, we might represent our words in just 100 or 300 dimensions. This compression brings words with similar meanings closer together."

"Word embeddings are essentially a result of this idea. They are dense vectors, typically of size 100, 300, or even 768 for some advanced models, representing words in a way that captures semantic relationships."

⁠

Python Code: Visualization of Dimensionality Reduction

pythonCopy code

# Ensure students have the necessary libraries installed

!pip install matplotlib numpy

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

# Assuming we have some dummy embeddings for visualization

dummy_embeddings = {

'king': [1.2, 2.1, 0.8, 0.5],

'queen': [1.1, 2.0, 0.7, 0.55],

'man': [0.5, 0.7, 0.1, 0.2],

'woman': [0.45, 0.65, 0.05, 0.25]

}

# Reduce dimensions for visualization using PCA

pca = PCA(n_components=2)

reduced = pca.fit_transform(list(dummy_embeddings.values()))

# Plot the words in 2D space

plt.figure(figsize=(8, 6))

plt.scatter(reduced[:, 0], reduced[:, 1], s=100, c='blue', edgecolors='k')

for i, word in enumerate(dummy_embeddings.keys()):

plt.annotate(word, xy=(reduced[i, 0], reduced[i, 1]), fontsize=12)

plt.title("Dimensionality Reduction using PCA")

plt.grid(True)

plt.show()

⁠

"As you can see, even with this simple example, words that we intuitively understand to be related, like 'king' and 'queen', are closer together in this reduced space. This proximity allows our models to understand and leverage the semantic relationships between words, making them more effective in tasks like text classification, sentiment analysis, and more."

⁠

This section provides students with a clear understanding of why dimensionality reduction is essential, especially in the context of NLP. The lecture notes and Python code together offer a comprehensive overview of the topic, balancing theoretical insights with practical demonstration.

⁠

Word Embeddings: Deep Dive (20 minutes)

What are word embeddings?

How do they capture semantic relationships?

Visualization: Showing pre-trained embeddings on a 2D plane using t-SNE.

Building Word Embeddings: Techniques (20 minutes)

Count-based methods: Co-occurrence matrix, Singular Value Decomposition (SVD).

Prediction-based methods: Introduction to Word2Vec (Skip-Gram and CBOW).

Brief mention of other methods: FastText, GloVe.

Hands-on Activity: Implementing Word2Vec from Scratch (40 minutes)

Setting up the environment: Necessary libraries and dataset.

Preprocessing: Tokenization, creating vocabulary.

Implementing Skip-Gram model: Defining the neural network architecture.

Training the model: Using a small corpus for demonstration.

Visualizing the learned embeddings.

Evaluation of Word Embeddings (15 minutes)

Intrinsic evaluation: Using analogy tasks (e.g., king - man + woman = ?).

Extrinsic evaluation: How embeddings can improve performance on downstream tasks.

From Word Embeddings to Generative AI Models (15 minutes)

Brief overview of generative models in NLP.

How word embeddings serve as an input layer for models like RNNs, LSTMs, and transformers.

Teaser for the next classes: Building on these embeddings to create generative models.

Assignment Discussion (10 minutes)

Task: Create word embeddings for a specific dataset/domain.

Evaluation criteria: Quality of embeddings, visualization, and insights derived.

Importance of this assignment as a precursor to the end-of-term project.

Q&A Session (10 minutes)

Addressing any doubts or questions from students.

Encouraging students to think critically about the potential and limitations of word embeddings.

Conclusion and Next Steps (5 minutes)

Recap of the day's learnings.

Assigning further readings and resources.

Setting expectations for the next class.

⁠

This outline provides a balanced mix of theory, practical implementation, and forward-looking insights. It ensures that students not only understand the concept of word embeddings but also get hands-on experience, setting them up perfectly for more advanced topics in the future.

AI Python lab using PyTorch and TensorFlow to create a word embedding

One possible complete solution is shown below:

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.