Skip to content

Gen AI & LLMs: Architecture & Data Preparation

Significance of Generative AI

Generative AI refers to deep-learning models that can generate various types of content such as text, images, audio, 3D objects and music
Text
Contextually aware models, example GPT
Image
From Text input
From Seed Image or random input
Example - GAN (Generative Adversarial Network), Diffusion Model
Audio
Generate Natural Sounding speech
Text to speech synthesis
Example - wavenet
Applications of Generative AI
Content Creation
Condensing Documents
Language Translation
Chatbots and Virtual Assistants
Data Analysis

Generative AI Architectures and Models

Generative AI architectures and models include RNNs, transformers, GANs, and VAEs and diffusion models.
RNN - Recurrent Neural Networks
Use sequential or time series data and a loop based design for training
Transformers
They utilize the self attention mechanism to focus on the most important parts of the information
GAN - Generative Adversarial Networks
Consists of a generator and discriminator, which work in a competitive mode
VAEs - Variational Auto Encoder
Operate on an encoder-decoder framework and create samples based on similar characteristics
Diffusion models
Generate creative images by learning to remove noise and reconstruct distorted examples, relying on statistical properties

Generative AI for NLP (Natural Language Processing)

Evolution of AI for NLP
Rule-Based System - Follows predefined linguistic rules
Machine Learning based approach - Employes statistical methods
Deep Learning architecture - Uses artificial neural networks trained on extensive data sets
Transformers - Designed specifically to handle sequential data, has greater ability to understand context
Large Language Models - LLMs
Uses AI and deep learning with vast data sets
Involves training data sets of huge sizes, even reaching petabytes (1PB =1 Million GB)
Contains billions of parameters, which are finetuned during training
Examples
GPT - Generative Pretrained transformers
BERT - Bidirectional Encoder Representation from Transformers
BART - Bidirectional and Auto-Regressive Transformer
T5 - Text To Text Transfer Transformer
Hallucinations in LLMs
Generating outputs presented as accurate but seen unrealistic, inaccurate, or nonsensical by humans
can result in generation of inaccurate information, creation of biased views and wrong input provided to sensitive applications.
Prevent/Avoid hallucinations through
Extensive training with high-quality data
Avoiding manipulation
Ongoing evaluation and improvement of the models
Fine-tuning on domain specific data
Being vigilant
Ensuring human oversight and
Providing additional context in the prompt
Libraries and Tools in NLP
PyTorch - an open source deep learning framework, python based and well known for its ease of use, flexibility, and dynamic computation graphs.
TensorFlow - open source framework for machine learning and deep learning, provides tools and libraries to facilitate the development and deployment of machine learning models
Keras - A tight integration of TensorFlow with Keras provides a user-friendly high-level neural networks API, facilitating rapid prototyping and building and training deep learning models.
Hugging Face - platform that offers an open source library with pre-trained models and tools to streamline the process of training and fine-tuning generative AI models. It offers libraries such as Transformers, Datasets, and Tokenizers.
LangChain - an opensource framework that helps streamline AI application developments using LLMs. It provides tools for designing effective prompts.
Pydantic - Python library that helps you streamline data handling. It ensures the accuracy of data types and formats before an application processes them.

Text Generation before Transformers

N-Gram Models
They predict what words come next in a sentence based on the words that came before.
Recurrent Neural Networks (RNN)
They are specially designed to handle sequential data, making them powerful for applications like language modeling and time series forecasting.
The essence of their design lies in maintaining a ‘memory’ or ‘hidden state’ throughout the sequence by employing loops.
This enables RNN to recognize and capture the temporal (time related) dependencies inherent in the sequential data.
Hidden state
often referred to as the network’s ‘memory’, the hidden state is a dynamic storage of information about previous sequence inputs. With each new input, this hidden state is updated, factoring in both the new input and its previous value.
Temporal dependency
Loops in RNNs enable information transfer across sequence steps.
Illustration of RNNs operation
“I love RNNs” - RNN goes on to interpret this sentence word by word,
First it ingests the word “I”, generates an output and updates its hidden state
Then it moves to “Love”, the RNN processes it alongside and updates the hidden state which ideally holds insights about the word “I”, the hidden state is updated again.
This pattern of processing and updating continues till the last word is reached.
image.png
Long short-term memory (LSTM) and Gated Recurrent Units (GRUs)
Variants of RNNS
Designed to address limitations of traditional RNNs and enhance their ability to model the sequential data effectively.
They were effective for variety of tasks, but they struggled with long sequences and long-term dependencies.
Seq2seq models with attention
Sequence-to-sequence models - built with RNNs or LSTMs, designed to handle tasks like translation where an input sentence is transformed into an output sentence.
Was introduced to allow the model to “Focus” on relevant parts of the input sequence when generating the output, significantly improving performance on tasks like machine translation.
While these methods provided significant advancements in text generation tasks, the introduction of transformers led to a paradigm shift. Transformers, with their self-attention mechanism, proved to be highly efficient at capturing contextual information across long sequences, setting new benchmark across various NLP tasks

Transformers

replaced the sequential processing with parallel processing.
the key component behind its success was attention mechanism, more precisely self-attention.
Key steps
Tokenization - breaking down the sentence into tokens
Embedding - Each token represented as a vector, capturing its meaning
Self-Attention - The model computes scores determining the importance of every other word for a particular word in the sequence. These scores are used to weight the input tokens and produce a new representation of the sequence.
Feed-forward neural networks - After attention, each position is passed through a feed-forward network separately.
Output Sequence: The model produces an output sequence, which can be used for various tasks, like classification, translation or text generation.
Layering: Importantly, transformers are deep models with multiple layers of attention and feed-forward networks, allowing them to learn complex patterns.

Implementation - Building a simple chatbot with transformers

Building a simple chatbot using transformers library from Hugging Face, which is an open-source NLP toolkit.
Step 1: Installing the libraries
!pip install -qq tensorflow
!pip install -qq transformers
!pip install sentencepiece
!pip install torch==2.2.2
!pip install torchtext==0.17.2
#!pip install --upgrade numpy transformers torch
Step 2: Importing required tools from the transformers library
In the code below, we initiate two important classes from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Selecting the model. You will be using "facebook/blenderbot-400M-distill" in this example.
model_name = "facebook/blenderbot-400M-distill"

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model is an instance of the class AutoModelForSeq2SeqLM.
This class lets you interact with your chosen language model
tokenizer is an instance of the class AutoTokenizer.
This class streamlines your input and presents it to the language model in the most efficient manner.
It achieves this by converting your text inputs into “tokens”, which is the model’s preferred way of interpreting text.
we have chosen “facebook/blenderbot-400M-distill” for this example model, because it is freely available under an open source license and operates at a relatively brisk pace.
for other models and their capabilities, explore:
Following the initialization, let’s set up the chat function to enable real-time interaction with the chatbot.
# Define the chat function
def chat with bot ():
while True:
# Get user input
input_text = input("You: ")
# Exit conditions
if input_text.lower() in ["quit", "exit", "bye"]
print ("Chatbot: Goodbye!")
break

# Tokenize input and generate response
inputs = tonkenizer.encode(input_text, return_tensors = "pt")
outputs = model.generate(input, max_new_tokens = 150)
response = tokenizer.decode(outputs[0], skip_special_tokens = True).strip()

# Display bot's response
print("Chatbot:",response)

#start chatting
chat_with_bot()

Now the chatbot would take the input prompt and use the power of transformers library and the underlying model to generate a response.
Step 3: Trying another language model and comparing the output
In the code below, we try to use a different language model, example “” model from Google, to create a similar chatbot.
import sentencepiece
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

### Let's chat with another bot
def chat_with_another_bot():
while True:
# Get user input
input_text = input("You: ")

# Exit conditions
if input_text.lower() in ["quit", "exit", "bye"]:
print("Chatbot: Goodbye!")
break

# Tokenize input and generate response
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
# Display bot's response
print("Chatbot:", response)

# Start chatting
chat_with_another_bot()

Tokenization

Tokenization breaks the sentence into smaller pieces or tokens. Tokenizers, such as NLTK and spaCy, generate tokens.
Word Based - preserves semantic meaning.
Character Based - smaller vocabularies but may not convey the same information as entire words.
Sub-word Based - allows frequently used words to stay unsplit while breaking down infrequent words.
can be implemented using the WordPiece, Unigram, and SentencePiece algorithms.
You can add special tokens such as <bos> at the beginning and <eos> at the end of a tokenized sentence.
Tokenization and Indexing in PyTorch
Use torchtext library for tokenization
Use the build_vocab_from_iterator function
creates a vocabulary from the tokens
assigns each token a unique index

Implementing Tokenization

Libraries needed
nltk” - Natural Language Tool kit
Employed for data management tasks
Offers comprehensive tools and resources for natural language text, making it a valuable choice for tasks such as text preprocessing and analysis
spaCy
Open-source software library for advanced natural language processing in Python.
known for its speed and accuracy in processing large volumes of text data
BertTokenizer
part of the Hugging Face Transformers library
specifically designed for tokenizing text according to the BERT model’s specifications
XLNetTokenizer
part of the Hugging Face Transformers library
Tailored for tokenizing text in alignment with the XLNet model’s requirement
torchtext
It is part of the PyTorch ecosystem
Simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management and batching
Installing required libraries
!pip install nltk
!pip install transformers
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install numpy scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
Importing required libraries
import nltk
nltk.download("punkt")
nltk.download('punkt_tab')
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Word Based Tokenizer - nltk

Splitting of text based on words
There are different rules for word based tokenizers, such as splitting on spaces or splitting on punctuation. Each option assigns a specific ID to the split word.
In the following example we use nltk’s word_tokenize
text = "This is a sample sentence for word tokenization"
tokens = word_tokenize(text)
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.