Significance of Generative AI
Generative AI refers to deep-learning models that can generate various types of content such as text, images, audio, 3D objects and music Contextually aware models, example GPT From Seed Image or random input Example - GAN (Generative Adversarial Network), Diffusion Model Generate Natural Sounding speech Applications of Generative AI Chatbots and Virtual Assistants Generative AI Architectures and Models
Generative AI architectures and models include RNNs, transformers, GANs, and VAEs and diffusion models.
RNN - Recurrent Neural Networks Use sequential or time series data and a loop based design for training
They utilize the self attention mechanism to focus on the most important parts of the information
GAN - Generative Adversarial Networks Consists of a generator and discriminator, which work in a competitive mode
VAEs - Variational Auto Encoder Operate on an encoder-decoder framework and create samples based on similar characteristics
Generate creative images by learning to remove noise and reconstruct distorted examples, relying on statistical properties
Generative AI for NLP (Natural Language Processing)
Rule-Based System - Follows predefined linguistic rules Machine Learning based approach - Employes statistical methods Deep Learning architecture - Uses artificial neural networks trained on extensive data sets Transformers - Designed specifically to handle sequential data, has greater ability to understand context Large Language Models - LLMs Uses AI and deep learning with vast data sets Involves training data sets of huge sizes, even reaching petabytes (1PB =1 Million GB) Contains billions of parameters, which are finetuned during training GPT - Generative Pretrained transformers BERT - Bidirectional Encoder Representation from Transformers BART - Bidirectional and Auto-Regressive Transformer T5 - Text To Text Transfer Transformer Generating outputs presented as accurate but seen unrealistic, inaccurate, or nonsensical by humans can result in generation of inaccurate information, creation of biased views and wrong input provided to sensitive applications. Prevent/Avoid hallucinations through Extensive training with high-quality data Ongoing evaluation and improvement of the models Fine-tuning on domain specific data Ensuring human oversight and Providing additional context in the prompt Libraries and Tools in NLP PyTorch - an open source deep learning framework, python based and well known for its ease of use, flexibility, and dynamic computation graphs. TensorFlow - open source framework for machine learning and deep learning, provides tools and libraries to facilitate the development and deployment of machine learning models Keras - A tight integration of TensorFlow with Keras provides a user-friendly high-level neural networks API, facilitating rapid prototyping and building and training deep learning models. Hugging Face - platform that offers an open source library with pre-trained models and tools to streamline the process of training and fine-tuning generative AI models. It offers libraries such as Transformers, Datasets, and Tokenizers. LangChain - an opensource framework that helps streamline AI application developments using LLMs. It provides tools for designing effective prompts. Pydantic - Python library that helps you streamline data handling. It ensures the accuracy of data types and formats before an application processes them. Text Generation before Transformers
They predict what words come next in a sentence based on the words that came before.
Recurrent Neural Networks (RNN) They are specially designed to handle sequential data, making them powerful for applications like language modeling and time series forecasting.
The essence of their design lies in maintaining a ‘memory’ or ‘hidden state’ throughout the sequence by employing loops.
This enables RNN to recognize and capture the temporal (time related) dependencies inherent in the sequential data.
often referred to as the network’s ‘memory’, the hidden state is a dynamic storage of information about previous sequence inputs. With each new input, this hidden state is updated, factoring in both the new input and its previous value. Loops in RNNs enable information transfer across sequence steps. Illustration of RNNs operation “I love RNNs” - RNN goes on to interpret this sentence word by word,
First it ingests the word “I”, generates an output and updates its hidden state Then it moves to “Love”, the RNN processes it alongside and updates the hidden state which ideally holds insights about the word “I”, the hidden state is updated again. This pattern of processing and updating continues till the last word is reached. Long short-term memory (LSTM) and Gated Recurrent Units (GRUs) Designed to address limitations of traditional RNNs and enhance their ability to model the sequential data effectively. They were effective for variety of tasks, but they struggled with long sequences and long-term dependencies. Seq2seq models with attention Sequence-to-sequence models - built with RNNs or LSTMs, designed to handle tasks like translation where an input sentence is transformed into an output sentence. Was introduced to allow the model to “Focus” on relevant parts of the input sequence when generating the output, significantly improving performance on tasks like machine translation. While these methods provided significant advancements in text generation tasks, the introduction of transformers led to a paradigm shift. Transformers, with their self-attention mechanism, proved to be highly efficient at capturing contextual information across long sequences, setting new benchmark across various NLP tasks
Transformers
replaced the sequential processing with parallel processing. the key component behind its success was attention mechanism, more precisely self-attention. Tokenization - breaking down the sentence into tokens Embedding - Each token represented as a vector, capturing its meaning Self-Attention - The model computes scores determining the importance of every other word for a particular word in the sequence. These scores are used to weight the input tokens and produce a new representation of the sequence. Feed-forward neural networks - After attention, each position is passed through a feed-forward network separately. Output Sequence: The model produces an output sequence, which can be used for various tasks, like classification, translation or text generation. Layering: Importantly, transformers are deep models with multiple layers of attention and feed-forward networks, allowing them to learn complex patterns. Implementation - Building a simple chatbot with transformers
Building a simple chatbot using transformers library from Hugging Face, which is an open-source NLP toolkit.
Step 1: Installing the libraries
!pip install -qq tensorflow
!pip install -qq transformers
!pip install sentencepiece
!pip install torch==2.2.2
!pip install torchtext==0.17.2
#!pip install --upgrade numpy transformers torch
Step 2: Importing required tools from the transformers library
In the code below, we initiate two important classes from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Selecting the model. You will be using "facebook/blenderbot-400M-distill" in this example.
model_name = "facebook/blenderbot-400M-distill"
# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model is an instance of the class AutoModelForSeq2SeqLM. This class lets you interact with your chosen language model tokenizer is an instance of the class AutoTokenizer. This class streamlines your input and presents it to the language model in the most efficient manner. It achieves this by converting your text inputs into “tokens”, which is the model’s preferred way of interpreting text. we have chosen “facebook/blenderbot-400M-distill” for this example model, because it is freely available under an open source license and operates at a relatively brisk pace. for other models and their capabilities, explore: Following the initialization, let’s set up the chat function to enable real-time interaction with the chatbot.
# Define the chat function
def chat with bot ():
while True:
# Get user input
input_text = input("You: ")
# Exit conditions
if input_text.lower() in ["quit", "exit", "bye"]
print ("Chatbot: Goodbye!")
break
# Tokenize input and generate response
inputs = tonkenizer.encode(input_text, return_tensors = "pt")
outputs = model.generate(input, max_new_tokens = 150)
response = tokenizer.decode(outputs[0], skip_special_tokens = True).strip()
# Display bot's response
print("Chatbot:",response)
#start chatting
chat_with_bot()
Now the chatbot would take the input prompt and use the power of transformers library and the underlying model to generate a response.
Step 3: Trying another language model and comparing the output
In the code below, we try to use a different language model, example “” model from Google, to create a similar chatbot. import sentencepiece
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
### Let's chat with another bot
def chat_with_another_bot():
while True:
# Get user input
input_text = input("You: ")
# Exit conditions
if input_text.lower() in ["quit", "exit", "bye"]:
print("Chatbot: Goodbye!")
break
# Tokenize input and generate response
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
# Display bot's response
print("Chatbot:", response)
# Start chatting
chat_with_another_bot()
Tokenization
Tokenization breaks the sentence into smaller pieces or tokens. Tokenizers, such as NLTK and spaCy, generate tokens.
Word Based - preserves semantic meaning. Character Based - smaller vocabularies but may not convey the same information as entire words. Sub-word Based - allows frequently used words to stay unsplit while breaking down infrequent words. can be implemented using the WordPiece, Unigram, and SentencePiece algorithms. You can add special tokens such as <bos> at the beginning and <eos> at the end of a tokenized sentence. Tokenization and Indexing in PyTorch
Use torchtext library for tokenization Use the build_vocab_from_iterator function creates a vocabulary from the tokens assigns each token a unique index Implementing Tokenization
Libraries needed
“nltk” - Natural Language Tool kit Employed for data management tasks Offers comprehensive tools and resources for natural language text, making it a valuable choice for tasks such as text preprocessing and analysis Open-source software library for advanced natural language processing in Python. known for its speed and accuracy in processing large volumes of text data part of the Hugging Face Transformers library specifically designed for tokenizing text according to the BERT model’s specifications part of the Hugging Face Transformers library Tailored for tokenizing text in alignment with the XLNet model’s requirement It is part of the PyTorch ecosystem Simplifies the process of working with text data and provides functionalities for data preprocessing, tokenization, vocabulary management and batching Installing required libraries
!pip install nltk
!pip install transformers
!pip install sentencepiece
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm
!pip install numpy scikit-learn
!pip install torch==2.2.2
!pip install torchtext==0.17.2
Importing required libraries
import nltk
nltk.download("punkt")
nltk.download('punkt_tab')
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from transformers import BertTokenizer
from transformers import XLNetTokenizer
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
Word Based Tokenizer - nltk
Splitting of text based on words There are different rules for word based tokenizers, such as splitting on spaces or splitting on punctuation. Each option assigns a specific ID to the split word. In the following example we use nltk’s word_tokenize text = "This is a sample sentence for word tokenization"
tokens = word_tokenize(text)