Explore

Lecture: Introduction to Building AI Model Embeddings with Artificial Neural Networks

Good morning class, and welcome to the exciting world of Artificial Intelligence! Today, we're going to take our first steps into building AI models, focusing on embeddings and how they work within the structure of artificial neural networks.

We'll be using Google Colab, an accessible platform that allows us to write and execute Python code through the browser.

What is an AI Model Embedding?

In the context of an AI language model: an embedding is a representation of data in a lower-dimensional space. {Nov 13 We will do some R programming to introduce some relational algebra and calculus concepts so you are grounded in these concepts.} Imagine you have a lot of data points with many features; Words (tokens) and weightings.

Embeddings : think about a BOX into which we can put these tokens and weightings.

This BOX = is WHAT? in terms of concepts which we have introduced?

The PYTORCH TENSOR FILE: an actual file that lives on the File System which you deploy to your SERVER for users to converse with

Embeddings allow us to convert these high-dimensional data points (numeric algebra matrix) into fewer dimensions so that similar data points are placed closer together in this new space.

What is a DIMENSION? Descriptors of the categories of data in the training data set.

⁠

Embeddings are crucial when dealing with types of data like text and images, where traditional numerical representations can be very sparse and inefficient.

Video, and BPEL still do this but with different kinds of Adapters.

Why Use Artificial Neural Networks?

Artificial Neural Networks (ANNs) are inspired by the biological neural networks that constitute animal brains.

They are a series of algorithms that endeavor to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates.

ANNs are capable of learning {adapting to existing patterns} and modeling complex patterns and decision boundaries.

How do these do this?

HOW do ANNs do pattern recognision which allows next token generation? Baysian Training is available as a Class with methods in the NLTK. We use this in conjunction with a language model such as Claude Anthropic which we can easily (and freely for simple cases) get from the HuggingFace Spaces API.

They are fundamental in many AI tasks like classification, regression, and even generative models.

Building a Simple Neural Network in Google Colab

Google Colab is a free cloud service that supports Python. It's an ideal platform for machine learning and AI education because it provides free access to hardware acceleration (GPUs and TPUs), which are essential for training neural network models efficiently.

Class Lab Activity: Lets create a neural network that learns to represent words as embeddings and uses those embeddings to predict the context in which a word appears.

⁠

Step 1: Setting Up the Environment

First, we need to set up our environment in Google Colab. Here's how you start a new notebook and import the necessary libraries.

Open a browser and go to

https://colab.research.google.com/⁠

Click on "New Notebook" to create a new notebook.

# Import the required libraries

import numpy as np

import tensorflow as tf

from tensorflow.keras import layers

from tensorflow.keras.models import Sequential

Step 2: Preparing the Data

We'll need some training text data to work with. For simplicity, we will create a small corpus of sentences.

pythonCopy code

# Define a corpus of sentences

corpus = [

'the quick brown fox jumps over the lazy dog',

'I am learning AI in college',

'building AI models is exciting',

'deep learning is a branch of machine learning'

]

# Tokenize the corpus

tokenizer = tf.keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(corpus)

vocab_size = len(tokenizer.word_index) + 1 # +1 for padding

sequences = tokenizer.texts_to_sequences(corpus)

# Let's look at our vocabulary

print("Vocabulary:", tokenizer.word_index)

The tokenizer.word_index in Python is a dictionary data structure. When using the Tokenizer class from the Keras preprocessing text module, word_index provides a mapping of words (as strings) to their integer indices.

Each word is a key in the dictionary, and the corresponding value is the unique integer that has been assigned to that word.

Here's an example of what the word_index might look like if it were printed out:

pythonCopy code

{

'the': 1,

'learning': 2,

'ai': 3,

'in': 4,

'quick': 5,

'brown': 6,

'fox': 7,

'jumps': 8,

'over': 9,

'lazy': 10,

'dog': 11,

'i': 12,

'am': 13,

'college': 14,

'building': 15,

'models': 16,

'is': 17,

'exciting': 18,

'deep': 19,

'a': 20,

'branch': 21,

'of': 22,

'machine': 23

}

In this dictionary, each word from the corpus has been given a unique integer index. These indices are used internally by the neural network when processing text data, as the network itself cannot process raw text but instead processes numerical representations of the text.

Step 3: Creating Training Data

We'll use the sequences to create our training data. Each word will be used to predict its context.

pythonCopy code

# Generate training pairs (context windows)

window_size = 2

def generate_context_word_pairs(corpus_seq, window_size, vocab_size):

context_length = window_size*2

for words in corpus_seq:

sentence_length = len(words)

for index, word in enumerate(words):

context_words = []

label_word = []

start = index - window_size

end = index + window_size + 1

context_words.append([words[i]

for i in range(start, end)

if 0 <= i < sentence_length

and i != index])

label_word.append(word)

x = tf.keras.utils.sequence.pad_sequences(context_words, maxlen=context_length)

y = tf.keras.utils.to_categorical(label_word, vocab_size)

yield (x, y)

# Prepare the data for training

pairs_gen = generate_context_word_pairs(sequences, window_size, vocab_size)

pairs = list(pairs_gen)

Step 4: Building the Neural Network

Now, we create a simple neural network with an embedding layer.

pythonCopy code

# Define the embedding dimension

embed_size = 128

# Build the model

model = Sequential([

layers.Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=window_size*2),

layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),

layers.Dense(vocab_size, activation='softmax')

])

# Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Summary of the model

model.summary()

Step 5: Training the Model

pythonCopy code

# Train the model

for epoch in range(1000):

loss, acc = 0, 0

for x, y in pairs:

loss, acc = model.train_on_batch(x, y)

if (epoch + 1) %

What and where exactly in this is the Embedding?

The Python program provided in the lab is designed to create an embedding for a dataset of short stories. Here's a detailed explanation of each line of the program:

Import the necessary libraries:

import pandas as pd

from keras.preprocessing.text import Tokenizer

from keras.models import Sequential

from keras.layers import Embedding, Flatten, Dense

This line imports the necessary libraries for the program. pandas is a data manipulation library, Tokenizer from keras.preprocessing.text is used to convert text into numerical data, Sequential from keras.models is used to define the architecture of the neural network, and Embedding, Flatten, and Dense from keras.layers are different types of layers that we can architect our neural network on.

2⁠

⁠

3⁠

⁠

4⁠

Load the dataset:

python

data = pd.read_csv('short_stories.csv')

This line loads the dataset from a CSV file into a pandas DataFrame. The DataFrame data contains the text of each short story along with its title

1⁠

Preprocess the text data:

python

tokenizer = Tokenizer(num_words=10000)

tokenizer.fit_on_texts(data['story'])

sequences = tokenizer.texts_to_sequences(data['story'])

These lines preprocess the text data. The Tokenizer is initialized with a maximum of 10,000 words. It is then fit on the text of the stories, which learns the word index for each word in the text. The text is then converted to sequences of integers, where each integer represents a specific word in the text

2⁠

⁠

3⁠

⁠

4⁠

Define the architecture of the neural network:

python

model = Sequential()

model.add(Embedding(10000, 8, input_length=100))

model.add(Flatten())

model.add(Dense(1, activation='sigmoid'))

These lines define the architecture of the neural network. The Sequential model is used, which means that the layers are added in sequence. The first layer is an Embedding layer with an input dimension of 10,000 (the maximum number of words), an output dimension of 8, and an input length of 100.

The Embedding layer transforms the sequences of integers into dense vectors of fixed size

2⁠

⁠

3⁠

⁠

4⁠

. The Flatten layer then flattens the input, and the Dense layer is the output layer with a single neuron and a sigmoid activation function

2⁠

⁠

3⁠

⁠

4⁠

Compile the model:

python

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

This line compiles the model. The RMSprop optimizer is used, the binary cross-entropy loss function is used, and the accuracy metric is used

2⁠

⁠

3⁠

⁠

4⁠

Train the model:

python

model.fit(sequences, data['label'], epochs=10, batch_size=32, validation_split=0.2)

This line trains the model on the sequences and the labels from the dataset. The model is trained for 10 epochs, with a batch size of 32, and a validation split of 0.2, which means that 20% of the data is used for validation

2⁠

⁠

3⁠

⁠

4⁠

The output of the Embedding layer represents the embedding of the input text data. This embedding can be used for various tasks, such as natural language processing and text classification

2⁠

⁠

3⁠

⁠

4⁠

I will test your project talking to it: The next token generation answers will be OUTPUT by the Embedding Layer.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.