Skip to content

Gen AI Foundational Models for NLP & Language Training

Language Understanding with Neural Networks


Converting Words to Features

One-hot encoding converts categorical data into feature vectors.
The bag of words representation portrays a document as the aggregate or average of one-hot encoded vectors.
When you feed a bag-of-words vector to a neural network’s hidden layer, the output is the sum of the embeddings.
The Embedding and EmbeddingBag classes implement embedding and embedding bags in PyTorch.

Document Categorization Prediction with Torchtext

Document Classifier seamlessly categorizes articles by analyzing the text content.
Neural Network is a mathematical function consisting of a sequence of matrix multiplications with a variety of other functions.
Argmax function identifies the index of the highest logit value, corresponding to the most likely class.
Hyperparameters are externally set configurations of a neural network.
Prediction function:
Works on real text that stars by taking in tokenized text.
Processes the text through the pipeline, and the model predicts the category.

Document Categorization Training with Torchtext

Neural networks functions via matrix and vector operations called learnable parameters
In neural network training:
Learnable parameters are fine-tuned to enhance model performance
Process is steered by the loss function, which serves as a measure of accuracy.
Prediction function
Works on real text that starts by taking in tokenized text.
Processes the text through the pipeline; the model predicts the category.
Cross entropy is used to find the best parameters
For unknown distribution, you can estimate it by averaging the function applied to a set of samples. This is known as Monte Carlo sampling
Optimization is used to minimize the loss
Three subsets of the partitioned data set are:
Training data
Validation data
Test data

Training the Model in PyTorch

Training data is split into training and validation, and then data loaders are set up for training, validation, and testing
Batch size specifies the sample count for gradient approximation
Data shuffling promotes better optimization
When defining model, init_weights helps with optimization
To train your loop:
Iterate over each epoch
Set model to training mode and calculate the total loss
Divide data set into batches
Perform gradient descent
Update loss after each batch is processed.

Classifying Documents (LAB)

Installing required libraries
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
!pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# - Update a specific package
!pip install pmdarima -U
# - Update a package to specific version
!pip install --upgrade pmdarima==2.0.2
# Note: If your environment doesn't support "!pip install", use "!mamba install"
Importing the required libraries
from tqdm import tqdm
import numpy as np
import pandas as pd
from itertools import accumulate
import matplotlib.pyplot as plt
from torchtext.data.utils import get_tokenizer

import torch
import torch.nn as nn

from torch.utils.data import DataLoader
import numpy as np
from torchtext.datasets import AG_NEWS
from IPython.display import Markdown as md
from tqdm import tqdm

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import AG_NEWS
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
from sklearn.manifold import TSNE
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split

from torchtext.data.utils import get_tokenizer

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
Defining helper functions
def plot(COST,ACC):
fig, ax1 = plt.subplots()
color = 'tab:red'
ax1.plot(COST, color=color)
ax1.set_xlabel('epoch', color=color)
ax1.set_ylabel('total loss', color=color)
ax1.tick_params(axis='y', color=color)
ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('accuracy', color=color) # you already handled the x-label with ax1
ax2.plot(ACC, color=color)
ax2.tick_params(axis='y', color=color)
fig.tight_layout() # otherwise the right y-label is slightly clipped
plt.show()
Summary Notes
One hot encoding converts categorical data (data representing groups or categories) into vectors
The Bag of words representation portrays a document as the aggregate or average of one-hot encoded vectors.
When you feed a bag of words vector to a neural network’s hidden layer, the output is the sum of the embeddings.
The Embedding and EmbeddingBag classes are used to implement embedding and embedding bags in PyTorch.
A document classifier seamlessly categorizes articles by analyzing the text content.
A neural network is a mathematical function consisting of a sequence of matrix multiplications with a variety of other functions.
The Argmax function identifies the index of the highest logit value, corresponding to the most likely class.
Hyperparameters are externally set configurations of a neural network.
The prediction function works on real text that starts by taking in the tokenized text. It processes the text through the pipeline, and the model predicts the category.
A neural network functions via matrix and vector operations called learnable parameters.
In neural network training, learnable parameters are fine-tuned to enhance model performance. This process is steered by the loss function, which serves as a measure of accuracy.
Cross-entropy is used to find the best parameters.
For unknown distribution, estimate it by averaging the function applied to a set of samples. This technique is known as Monte Carlo sampling.
Optimization is used to minimize the loss.
Generally, the data set should be partitioned into three subsets: training data for learning, validation data for hyperparameter tuning, and test data to evaluate real world performance.
The training data is split into training and validation, and then data loaders are set up for training, validation and testing.
Batch size specifies the sample count for gradient approximation, and shuffling the data promotes better optimization.
When you define your model, init_weights helps with optimization
To train your loop:
Iterate over each epoch
Set the model to training mode
Calculate the total loss
Divide the data set into batches
Perform gradient descent
Update the loss after each batch is processed.

N-Gram Model

Predicting the next word on the basis of the previous words.
image.png
with T being the context size
Bigram Models
considers the immediate previous word to determine the probability.
context size = 1
Limited context size causes it to create incorrect predictions
Trigram Model
considers the probability of the previous word along with the before that to predict the next word.
image.png
N-gram model
allows for any context size
Context Vector
the product of “Context Size” and the “Size of the Vocabulary”
Not computed directly but constructed by concatenating the embedding vectors

Neural Network Architecture in N-Gram Models

image.png

N-Grams as Neural Networks with PyTorch


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.