Share
Explore

F23 AML3304 Assignment 1 Embeddings

Clarification Questions:
For DOCUMENTATION:
TRELLO ( look at my Trello as an Example)
TEXT Cells of your Collab Collab Notebook.

The central task of Assignment 1 is to create an Embedding (which is a Python Data Structure).
You will create this Embedding using the procedures in this Lab Workbook:
You will choose a TEACHER (and in your documentation: why WHY that model)
Provision some input training data (by example, by walking over a Text File and glob all the text contents in a variable, and pass that variable into your Training Method.
Once you have created the embedding, that is good for now.

Later on: You will use this embedding in your AI model to do Next Token Generation for project.


The preferred way to ask questions is the Slack Channel:
Resources:

Dropbox Submit Location for all Assets:
HuggingFace Transformers Course:
3 Ways to use compute to do your assignment and project:
{Tell me in your TRELLO board what kind of compute you are using}
Local premises: Your own laptop : collaborate on code via GITHUB repo pushes and pulls. If you do this, provide in your TRELLO Board you GITHUB REPO URL.
Google Collab Workbook : Provide in your TRELLO Board your URL: Make a collaborator on your Google Collab Workbook.
Hugging Face Spaces: make me a member of your TEAM ROOM and post the URL to your TRELLO Board.
Collab - Huggingface, and TRELLO Board LINK: - After the course is over: Post this on LINKED IN to showcase to employers.

What to do for Assignment 1:

Make an Embedding!

Questions to answer: Answer these in your TRELLO Board.
The purpose is “meta-cognition”: (Thinking about what you are thinking about will soak the insights and intuitions about the subject into your Understanding Space faster).
Describe the MODEL that you are using in your code as the Model for your Embedding.
Research and Discuss WHY you choose that model.
How is it of particular value to your Project Business Domain.

*** Additional Requirement: Make a TEAM Trello Board for your Assignment and Project.

IMPORTANT: Add as AN EDITOR of your TRELLO BOARD.

The EDIT Access URL for your TRELLO Board: Put it into a TEXT file :
Name the text file as teamname.txt and upload to


Course Outline: High Performance compute is what we need to make real world AI Language Models work, at work, with realistically large sets of input training data sets, and with many users making many requests everyday.

image.png


image.png
image.png

The preferred way to communicate about course materials, assignments and project is Slack:

image.png
image.png



Due November 24 Friday
Format of how to hand this in:
You can work in Teams of up to 4 members.

Hand in format Options: {Your choice, any way you can do it is fine}

Can we simply submit a Google Colab notebook with markdown notes?
Google collab notebook must contain::
- all team members name/id
- answers to the assignment questions.

Local Premises : VSC: Put your team’s GITHUB url into a TEXT file and Dropbox submit.

Team’s Hugging Face Spaces: test, prototype, experment with code

Google Collab Workbook: Hand in by making an editting member of your Collab workbook
Advantages of Collab: Use cells to put code and documentation in MD format.
Easy for team members to collaborate.


Deliverables for your Assignment: Discuss these Questions in your TRELLO Board
Research and select a MODEL for your Embedding (and therefore later your Project), and support and defend your reasoning and decision making as to why you choose that MODEL for your Use Cases and Business Domain:
*—- If you were doing this at work: What licensing and pricing considerations for using the APIs would factor into account?


Prerequisite understanding for doing this:
The concept of the PYTHON NEURON, and the concepts of the ANN and the GAN: In this Lecture Notebook:
Using a Cloud Graphical Processor to run your programs:
Optionally you can experiment with running code in

Resources:


Click to access →


For Assignment 1, you will create a Word Embedding.

How you will deliver this:
You can word in Teams of up to 4 people.
Your deliverables will include;
Documentation which you will present as Documentation Cells in your Google Collab Notebook.
Code presented in your Google Collab notebook.
In addition to your code, I would like to see documentation explaining your thinking and design process in putting together your Embedding.

Learning Outcomes: Understand what an Embedding is.


Lecture and Lab Outline: Building an Embedding Starting with a Neuron

Lecture Outline:

Introduction
Recap of Neural Networks and ANNs.
Introduction to Embeddings: Definition and importance in Machine Learning.
Deep Dive into the Neuron
Revisit the concept of a neuron: weighted inputs, activation function, and output.
Importance of the neuron as the basic computational unit.
Lecture: From Neuron to Layer

Introduction

Understanding neural networks begins with grasping the concept of the fundamental unit that forms them: the neuron. Just as neurons in the brain work in harmony to enable complex thought and action, artificial neurons, or perceptrons, come together to form layers, enabling computational prowess in artificial neural networks (ANNs).

1. The Basic Neuron: A Quick Recap

a. Definition of a Neuron (Perceptron)
A neuron takes a set of inputs, applies weights, sums them up, and then passes the sum through an activation function to produce an output.
b. Components
Inputs: Data fed into the neuron (e.g., feature values).
Weights: Parameters that determine the importance or influence of a given input.
Bias: An additional parameter that allows for flexibility in fitting the model.
Activation Function: A function (e.g., sigmoid, ReLU) that introduces non-linearity and determines the neuron's output.
c. Mathematical Representation

Where:
y is the output,
w are the weights,
x are the inputs,
b is the bias, and
f is the activation function.

2. Transition from a Single Neuron to Multiple Neurons Forming a Layer

a. The Need for Multiple Neurons
Real-world data is multi-dimensional. A single neuron can't capture the complexity of such data.
Multiple neurons can learn different features or patterns from the input data.
b. Structure of a Layer
A layer consists of multiple neurons arranged in parallel.
Each neuron in a layer receives the same input but has its weights, biases, and potentially different activation functions.
c. Concept of Dense or Fully Connected Layers
In dense layers, each neuron in the previous layer is connected to every neuron in the next layer.
This ensures that features learned by one neuron are available for subsequent neurons.

3. How Layers Work in Tandem in ANNs

a. Layered Architecture
Input Layer: Receives raw data. The number of neurons typically corresponds to the number of features.
Hidden Layer(s): Layers between the input and output. They process and transform data, extracting and refining features. Most of the "learning" occurs here.
Output Layer: Produces the final prediction or classification. The number of neurons corresponds to the number of output classes or regression outputs.
b. Data Transformation Across Layers
As data progresses through layers, it undergoes transformations. Initial layers might learn basic patterns (e.g., edges in images or specific sounds in audio). Subsequent layers combine these to detect more complex features (shapes, textures, or semantic meaning).
c. Activation Functions in Layers
While deeper networks can model more complex relationships, they can also introduce challenges like vanishing and exploding gradients.
Careful selection of activation functions (e.g., ReLU for hidden layers) can help mitigate these issues.
d. Importance of Layer Design
The architecture of layers (number of layers, number of neurons per layer) directly impacts the network's performance and its ability to generalize.
Design decisions should be informed by the nature of the data and the problem being addressed.

Conclusion

Moving from the concept of a singular neuron to the intricate web of layers in an ANN helps us appreciate the layered structure's power and sophistication. Each layer, built upon a foundation of individual neurons, processes data, learns patterns, and collaboratively contributes to the network's final decision or prediction. As we further delve into neural network architectures, understanding this transition from neuron to layer remains fundamental.

Lab Activity: Building and Visualizing a Layer in Python

Objective: Understand the transition from a single neuron to a layer by constructing and visualizing a basic neural network layer using Python and TensorFlow/Keras.
Steps:
Setup: Install and import necessary libraries (TensorFlow, Keras, and others).
Single Neuron Implementation: Build a neuron that takes a set of inputs, applies weights and bias, and produces an output using an activation function.
Scaling to a Layer: Expand the single neuron to a layer with multiple neurons. Observe how each neuron produces different outputs for the same input.
Visualization: Use libraries like Matplotlib or Seaborn to visualize the outputs of each neuron in the layer for different inputs.
Discussion: Analyze how different neurons in a layer can capture different patterns or features from the same input.
By the end of this lecture and lab activity, students will have a concrete understanding of how individual neurons come together to form layers in a neural network, setting the foundation for more advanced topics in deep learning.
Introduction to Embeddings
Definition and significance of embeddings in representing categorical data.
Differences between one-hot encoding and embeddings.
Building an Embedding Layer
Architecture of an embedding layer.
How embeddings reduce dimensionality and capture semantic relationships.
Embeddings in Practice
Practical applications: Word embeddings in NLP (e.g., Word2Vec, GloVe).
Importance in collaborative filtering for recommendation systems.
Conclusion
Recap of the journey from a single neuron to creating embeddings.
Future scope and advanced topics related to embeddings.

Lab Outline:

Setting Up the Environment
Installing necessary libraries and tools.
Overview of the dataset: Using a sample dataset with categorical features.
Building a Basic Neuron
Constructing a single neuron using Python.
Applying a sample input and observing the output after an activation function.
Constructing a Neural Network Layer
Expanding the single neuron to a basic neural network layer.
Forward pass with sample data.
One-hot Encoding vs. Embeddings
Implementing one-hot encoding on a sample categorical feature.
Observing the sparsity and dimensionality issues.
Building an Embedding Layer
Using TensorFlow/Keras to build an embedding layer.
Initializing and inspecting embedding weights.
Training a Neural Network with an Embedding Layer
Constructing a simple neural network model with an embedding layer for categorical data representation.
Training the model and adjusting embedding weights.
Visualizing Embeddings
Using tools like TensorBoard to visualize the embeddings.
Observing how similar categories cluster together in the embedding space.
Conclusion and Cleanup
Summarizing the lab activities.
Resources for further exploration and learning.

Summation:

Embeddings play a critical role in representing and processing categorical data in neural networks, especially in areas like natural language processing. Starting with the foundational concept of a neuron and building up to embeddings allows students to grasp the evolution and intricacies of data representation in deep learning. This lecture and lab sequence provides a structured approach to understand and implement embeddings, bridging the gap between theory and practical application.
Note that for the Project I will be demonstrating this using the standard teaching corpus of the Guttenburg Corpus:
This Github also has alot of good sample Python Code you can study and use for your Project.
info

Let's create an embedding for textual data using TensorFlow's Keras API. This example will show you how to convert text into dense vectors (embeddings) that capture the semantic meaning of the words.

Embedding Creation using Word2Vec

1. Preparing Data

Let's use a small corpus for simplicity:
pythonCopy code
corpus = [
'I love machine learning',
'I love deep learning',
'Deep learning is a subfield of machine learning',
'AI is fascinating',
'Machine learning is fascinating'
]

2. Text Preprocessing

We'll tokenize the sentences and convert them to integers:
pythonCopy code
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenization
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# Convert text to sequence of integers
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)

# Pad sequences for equal length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

3. Creating the Embedding Model

We'll use an Embedding layer to convert integer sequences into dense vectors.
pythonCopy code
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_sequence_len-1)) # Embedding layer
model.add(LSTM(50))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

4. Prepare the Training Data and Train the Model

pythonCopy code
from tensorflow.keras.utils import to_categorical

# Splitting data into predictors and label
X = input_sequences[:,:-1]
y = input_sequences[:,-1]

# One-hot encoding the labels
y = to_categorical(y, num_classes=total_words)

# Training the model
model.fit(X, y, epochs=200, verbose=1)

5. Extracting the Embeddings

After training, the embedding for each word can be extracted from the weights of the Embedding layer.
pythonCopy code
embedding_layer = model.layers[0]
weights = embedding_layer.get_weights()[0]

# Create a dictionary to store the embeddings
word_embeddings = {}
for word, i in tokenizer.word_index.items():
word_embeddings[word] = weights[i]

Now, word_embeddings contains the dense vector (embedding) for each word in the corpus. You can retrieve them using:
pythonCopy code
print(word_embeddings['machine'])

Remember that in practice, you would likely use a larger corpus and potentially a more complex model to capture more intricate semantic relationships between words. This is a simple example to illustrate the concept of embeddings. If you want pre-trained embeddings, you can use models like Word2Vec, GloVe, or FastText from libraries such as Gensim or TensorFlow's hub module.

Lecture: Role of the Python Neuron in Creating Word Embeddings

Introduction
How the concept of a Python neuron ties in with our recent lab on creating word embeddings.
We'll delve into the foundational role of the neuron in deep learning, its implementation in TensorFlow, and how it contributes to the process of creating word embeddings.
1. Revisiting the Concept of a Neuron
A neuron, or more formally, a perceptron, forms the basic building block of deep learning models. It receives one or more inputs, processes them, applies an activation function, and produces an output.
Key Points:
Weighted Sum: Inputs are multiplied by weights, which are adjusted during training.
Activation Function: Transforms the weighted sum into an output. Common examples include the sigmoid, tanh, and ReLU functions.
2. Neurons in Deep Learning Frameworks
In TensorFlow/Keras, the perceptron is abstracted away. Instead, you work with layers of neurons. However, each neuron within a layer fundamentally operates on the same principles we've discussed.
Key Points:
Dense Layer: In Keras, when you use a Dense layer, you're essentially deploying numerous neurons. Each neuron in a dense layer connects to every neuron in the previous layer.
Activation Function: When defining layers in Keras, you can specify the activation function, e.g., Dense(128, activation='relu').
3. Role of Neurons in the Embedding Lab
In our embedding lab, the core component was the Embedding layer. However, the deep learning model also included an LSTM layer and a Dense layer, both of which consist of multiple neurons.
Key Points:
Embedding Layer: This layer doesn't consist of traditional neurons but can be thought of as a lookup table where each word in our vocabulary is associated with a dense vector. It's the transformation of categorical data (words) into continuous vectors.
LSTM Layer: Long Short-Term Memory (LSTM) units are a type of neuron designed to remember patterns over time. Each LSTM unit is more complex than a simple neuron but operates on similar principles.
Dense Layer: The final Dense layer contains neurons that produce the output probabilities for each word in our vocabulary. It's these neurons that do the heavy lifting, determining the context around each word.
4. Training Process and the Role of Neurons
When we train our model, we're adjusting the weights in each neuron to minimize the difference between our predicted output and the actual output.
Key Points:
Backpropagation: As our model processes input data, it makes predictions. The difference between these predictions and the actual data is the "error." Using an algorithm called backpropagation, this error is passed backward through the network, adjusting the neuron weights along the way.
Optimization: The goal is to adjust the neuron weights to minimize this error. This optimization is done using algorithms like Adam, SGD, etc.
5. The Final Picture: Neurons and Word Embeddings
In our lab, the neurons' ultimate goal was to adjust their weights to predict the next word in a sentence based on the context provided by the preceding words.
As the model gets better at this task, the weights within the Embedding layer adjust to capture the semantic relationships between words. By the end of training, words with similar meanings or that often appear in similar contexts have similar vectors, giving us our word embeddings.
Conclusion
At the heart of our word embedding creation process are the principles governing the operation of individual neurons. These neurons, working in harmony across layers, process and reprocess our data, adjusting their weights to capture the essence of our corpus's linguistic structure.
By understanding the role of these fundamental building blocks, we gain a deeper appreciation for the power and flexibility of deep learning models, from simple tasks to complex ones like creating word embeddings.
Next steps:
We will dive deeper into optimization techniques and their role in training deep learning models.

megaphone

Grading Rubric for Embedding Creation Assignment

Ping me on Slack with questions or for more ideas!

Total Points: 100

1. Understanding & Theoretical Explanation (20 points)

Using TEXT cells in Google Collab Notebook: Provide
Clear explanation of the concept of embeddings and their role in AI (5 points)
Explanation of algorithm or method chosen for embedding creation (5 points)
Discussion on why the selected method is suitable for the task (5 points)
Understanding of the relationship between embeddings and downstream tasks (5 points)

2. Data Preprocessing (15 points)

Proper handling of text data, including cleaning and tokenization (5 points)
Justification of preprocessing decisions (5 points)
Effective use of vocabularies, handling of out-of-vocabulary words, and special tokens (5 points)

3. Model Design & Implementation (25 points)

Appropriateness and complexity of the embedding model architecture (10 points)
Correct implementation of the embedding model (code correctness, use of APIs) (10 points)
Use of regularization, normalization, or other techniques to improve model (5 points)

4. Training Process (15 points)

Correctness in setting up the training loop (5 points)
Appropriate choice of loss functions and optimizers (5 points)
Implementation of evaluation metrics and validation checks (5 points)


6. Results & Evaluation (10 points)

Accuracy and quality of the resulting embeddings (5 points)
Clear presentation of results with appropriate metrics (5 points)

7. Discussion & Critical Analysis (5 points)

Discussion of the strengths and weaknesses of the created embeddings (2.5 points)

Bonus Points (up to 5 extra credit points):

Creativity and originality in the embedding approach (2 points)
Going beyond requirements by incorporating advanced techniques (like subword embeddings, contextual embeddings, etc.) (3 points)

Notes:

Late submissions: Deduct 10% of total score for each day late.
Code Quality: Clear, readable, and well-commented code will be crucial throughout all sections. Points may be deducted for poor code readability.

Additional Information:

Students are required to provide a report with their submission, documenting their approach, results, and analyses.
Code submitted must be executable and free of errors. Non-executable code will receive a significant deduction.
Use of external libraries and tools is allowed, but must be properly cited.
This rubric provides detailed criteria for each part of the embedding creation assignment, helping students understand the expectations and allowing them to focus their efforts accordingly. It also aids instructors in providing objective and consistent assessments.

Grading Rubric for First-Level Student Embedding Assignment

Total Points: 100

1. Presentation in Google Colab Notebook (25 points)

Organization: Logical flow of content with clear headings and subheadings (5 points)
Clarity: Code cells and outputs are presented in a way that is easy to follow and understand (5 points)
Comments & Explanations: Each code cell is accompanied by comments or markdown cells explaining the purpose and output of the code (10 points)
Visualization: Use of charts or tables to visually represent data, where applicable (5 points)

2. Use of Hugging Face APIs (25 points)

Correctness: Accurate use of the Hugging Face tokenizer and model API calls to generate embeddings (10 points)
Execution: Code is executable without errors in the Google Colab environment (10 points)
Experiments: Shows evidence of experimenting with different Hugging Face models or parameters (5 points)

3. Code Simplicity & Cleanliness (20 points)

Readability: Code is well-formatted, properly indented, and easy to read (10 points)
Documentation: In-line comments are present and useful for understanding the code (5 points)
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.