Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant. It encompasses a wide range of tasks, from basic tasks like text classification and sentiment analysis to more complex tasks such as machine translation and question answering.
The NLP pipeline typically involves several stages:
Text Preprocessing
Tokenization
Linguistic Analysis
Feature Extraction
Modeling and Prediction
Evaluation and Optimization
Regular Expressions
Regular expressions (Regex) are a very important tool in natural language processing, used to identify specific patterns of text. Regex provides a concise and flexible methods to "search," "match," and "manage" strings of text. For example, in preprocessing tasks, Regex can be used to extract dates, phone numbers, or specific word patterns like email addresses from large volumes of text. They work through a combination of symbols that construct a search pattern, which is important for text tokenization, cleaning data from unwanted characters, and separating text parts as needed.
Parsing with CFGs
A context-free grammar (CFG) is a list of rules, where the LHS consists of a syntactic category and the RHS describes the substitution for them. They can also be represented as pushdown automata (PDA), which is a finite automaton that uses a stack structure of “last in, first out” to allow for more flexibility with recursive rules compared to deterministic finite automata (DFA) & non-deterministic finite automaton (NFA).
For e.g. S → XS, X → a
Typically, a CFG consists of variables like ‘X’ in the example above, which can be substituted to form a string with terminals like ‘a’. ‘S’ is the starting variable for a CFG.
For parsing the syntax, the structure of a language, of a linguistic grammar like English, French, etc. we need to use constituents. A constituent is a group of words which behave as a single unit in syntactic analysis. For e.g. A noun phrase such as “a dog” or “the cat”
The following is a list of syntactic categories, consisting of constituents used in a CFG for parsing natural languages:
np - noun phrase
vp - verb phrase
s - sentence
det - determiner
n - noun
tv - transitive verb
iv - intransitive verb
prep - preposition
pp - prepositional phrase
adj - adjective
The reason CFGs are appropriate for formally analyzing the structure of a natural language is because they support recursive rules such as adj → det adj n, which can give us phrases such as “the good dog”, “the really good dog”, “the really good, cute dog”. The other reason CFGs are appropriate is that regular languages are too weak to represent natural languages and Turing-equivalent languages are too powerful.
CFG:
S → np vp
np → det n
vp → tv np | iv
det → the | a | an
n → giraffe | apple
iv → dreams
tv → eats | dreams
Interactive simulation for parsing of “the giraffe dreams” :
S
Step 1
Step 2
Step 3
Step 4
Step 5
Reset
Goals of linguistic grammar
Permit ambiguity - when you have multiple parse tree for the derivation of a valid string output of a CFG
Limit ungrammaticality - the syntax must be grammatical correct for the natural language
Ensure meaningfulness - recognize synonymity with words like “huge” and “large” having similar meanings.
NLP vs PLP
CFGs are used in the context of strictly defined syntax for programming language processing (PLP) to compile languages like C, C++, Java, etc. However, they are also used for natural language processing (NLP) with the following table showing the main differences for parsing them:
Characteristics
NLP
PLP
Characteristics
NLP
PLP
1
domain of discourse
broad: what can be expressed
narrow: what can be computed
2
lexicon
large/complex
small/simple
3
grammatical constructs
many and varied
- declarative
- interrogative
- fragments
etc.
few
- declarative
- imperative
4
meanings of an expression
many
one
5
tools and techniques
morphological analysis
syntactic analysis
semantic analysis
integration of world knowledge
Parsing transforms natural language into a more structured representation, helping machines in understanding human language. One famous type of parsing in NLP is Syntactic Parsing:
Shallow Parsing (Chunking): It breaks down text into its constituent parts of speech, usually not deeper than the phrase level, grouping together nouns with their descriptors or verbs with their associated phrases.
Constituency Parsing: It builds a parse tree for a sentence that represents the syntactic structure according to a formal grammar. Each node in the tree represents a constituent (e.g., noun phrase, verb phrase).
Dependency Parsing: It focuses on the relationships between words in a sentence. It builds a tree structure based on the dependencies between words, where each node is connected to a 'head'. Dependency parsers are important in understanding the grammatical structure of a sentence and the relationships between its elements, making tasks like information extraction and syntactic reordering for machine translation achievable.
Tokenization
Tokenization is the process of splitting text into meaningful elements called tokens. In NLP, tokens represent words, but they can also include punctuation and other characters depending on the granularity required. Tokenization is essential as it is the first step in converting unstructured text into a form that can be analyzed. It's important for tasks like sentiment analysis, where identifying and analyzing individual words is crucial. Techniques vary from simple white-space tokenization to more complex methods capable of recognizing multi-word expressions and handling abbreviations or contractions.
POS Tagging (Part-of-Speech Tagging)
After tokenization, POS tagging assigns grammatical parts of speech to each token, such as noun, verb, adjective, etc. This tagging is important for understanding the syntactic structure of sentences and plays an essential role in disambiguating homonyms (words that have the same spelling but different meanings) based on their usage in a sentence. For example, "run" can be a verb ("I will run") or a noun ("a run in my stockings").
Lemmatization
Lemmatization is about reducing words to their base or dictionary form (lemma). Unlike stemming, which just removes word endings, lemmatization considers the context and transforms words based on their actual morphological analysis, making sure that the root word (lemma) belongs to the language. For instance, "better" is lemmatized to "good," and "ran" to "run." Lemmatization is particularly useful in building language models and search engines, where different forms of a word need to be treated as the same term to improve search relevance and performance.
Models for Text Representation
Computers need to represent text as a collection of numbers (typically a vector) in order to effectively work with it. We discuss two ways of representing text for NLP.
Bag of words
In short, a bag of words is a data structure that keeps count of the number of occurrences of each word in some text. We can use it to determine the highest occurring words in the text. This can be used for semantic analysis. To convert some text into a vector using this method, we can use the counts of the frequencies of words in the vocabulary as the entries. As an example, if our language consists of the words {car, cloud}, the sentences “car car car car cloud” and “cloud car cloud cloud car car” could be encoded as [4, 1] and [3, 3] respectively.
Embeddings
An embedding is a representation of some text as a vector. The term is often used to refer to methods that use neural networks to create those representations.
When converting text to a vector, we want similar text to be “close” together and dissimilar text to be far apart. There are many different ways of measuring distance between vectors. A simple method is the cosine similarity, which is related to the angle between the vectors. A more acute angle means the vectors are more similar to each other. We will use a SentenceTransformer model to explore this topic.
In the images below, three coordinates of embeddings of different pairs of texts are visualized. You can interact with the demo in the
(from the Basic Text Embedding section). We can see that similar text point in the same direction, and less similar text point in different directions.
To create an embedding model, the neural network’s parameters are initialized to random values. It is then trained on text that should be similar to tune those parameters to achieve the objective. Many models currently exist that have been trained to embed text like this.
Once an embedding model has been trained, we can use it to embed text and get information about the semantic similarity between pieces of text.
One way we can encode texts that arise from a context-free grammar is to look at the derivations of the texts. If the derivation rules are similar, then the texts likely have some similarity in structure that might relate to meaning. This is even more true for context-sensitive languages. We can think of the parameters of the neural network as learning those key connections between strings derived in a language.