Explore

Natural Language Processing

Fady Hanna, Joshua Mark & Ninad Patil

Introduction

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant. It encompasses a wide range of tasks, from basic tasks like text classification and sentiment analysis to more complex tasks such as machine translation and question answering.

The NLP pipeline typically involves several stages:

Text Preprocessing

Tokenization

Linguistic Analysis

Feature Extraction

Modeling and Prediction

Evaluation and Optimization

Regular Expressions

Regular expressions (Regex) are a very important tool in natural language processing, used to identify specific patterns of text. Regex provides a concise and flexible methods to "search," "match," and "manage" strings of text. For example, in preprocessing tasks, Regex can be used to extract dates, phone numbers, or specific word patterns like email addresses from large volumes of text. They work through a combination of symbols that construct a search pattern, which is important for text tokenization, cleaning data from unwanted characters, and separating text parts as needed.

⁠

⁠

Parsing with CFGs

A context-free grammar (CFG) is a list of rules, where the LHS consists of a syntactic category and the RHS describes the substitution for them. They can also be represented as pushdown automata (PDA), which is a finite automaton that uses a stack structure of “last in, first out” to allow for more flexibility with recursive rules compared to deterministic finite automata (DFA) & non-deterministic finite automaton (NFA).

For e.g. S → XS, X → a

Typically, a CFG consists of variables like ‘X’ in the example above, which can be substituted to form a string with terminals like ‘a’. ‘S’ is the starting variable for a CFG.

For parsing the syntax, the structure of a language, of a linguistic grammar like English, French, etc. we need to use constituents. A constituent is a group of words which behave as a single unit in syntactic analysis. For e.g. A noun phrase such as “a dog” or “the cat”

The following is a list of syntactic categories, consisting of constituents used in a CFG for parsing natural languages:

np - noun phrase

vp - verb phrase

s - sentence

det - determiner

n - noun

tv - transitive verb

iv - intransitive verb

prep - preposition

pp - prepositional phrase

adj - adjective

The reason CFGs are appropriate for formally analyzing the structure of a natural language is because they support recursive rules such as adj → det adj n, which can give us phrases such as “the good dog”, “the really good dog”, “the really good, cute dog”. The other reason CFGs are appropriate is that regular languages are too weak to represent natural languages and Turing-equivalent languages are too powerful.

CFG:

S → np vp

np → det n

vp → tv np | iv

det → the | a | an

n → giraffe | apple

iv → dreams

tv → eats | dreams

Interactive simulation for parsing of “the giraffe dreams” :

⁠

Step 1

⁠

Step 2

⁠

Step 3

⁠

Step 4

⁠

Step 5

⁠

Reset

⁠

Goals of linguistic grammar

Permit ambiguity - when you have multiple parse tree for the derivation of a valid string output of a CFG

Limit ungrammaticality - the syntax must be grammatical correct for the natural language

Ensure meaningfulness - recognize synonymity with words like “huge” and “large” having similar meanings.

NLP vs PLP

CFGs are used in the context of strictly defined syntax for programming language processing (PLP) to compile languages like C, C++, Java, etc. However, they are also used for natural language processing (NLP) with the following table showing the main differences for parsing them:

Characteristics

NLP

PLP

domain of discourse

broad: what can be expressed

narrow: what can be computed

lexicon

large/complex

small/simple

grammatical constructs

many and varied - declarative - interrogative - fragments etc.

few - declarative - imperative

meanings of an expression

many

one

tools and techniques

morphological analysis syntactic analysis semantic analysis integration of world knowledge

lexical analysis context-free parsing code generation/compiling interpreting

There are no rows in this table

⁠

Parsing in NLP

Parsing transforms natural language into a more structured representation, helping machines in understanding human language. One famous type of parsing in NLP is Syntactic Parsing:

Shallow Parsing (Chunking): It breaks down text into its constituent parts of speech, usually not deeper than the phrase level, grouping together nouns with their descriptors or verbs with their associated phrases.

⁠

⁠

Constituency Parsing: It builds a parse tree for a sentence that represents the syntactic structure according to a formal grammar. Each node in the tree represents a constituent (e.g., noun phrase, verb phrase).

⁠

⁠

Dependency Parsing: It focuses on the relationships between words in a sentence. It builds a tree structure based on the dependencies between words, where each node is connected to a 'head'. Dependency parsers are important in understanding the grammatical structure of a sentence and the relationships between its elements, making tasks like information extraction and syntactic reordering for machine translation achievable.

⁠

⁠

Tokenization

Tokenization is the process of splitting text into meaningful elements called tokens. In NLP, tokens represent words, but they can also include punctuation and other characters depending on the granularity required. Tokenization is essential as it is the first step in converting unstructured text into a form that can be analyzed. It's important for tasks like sentiment analysis, where identifying and analyzing individual words is crucial. Techniques vary from simple white-space tokenization to more complex methods capable of recognizing multi-word expressions and handling abbreviations or contractions.

⁠

⁠

POS Tagging (Part-of-Speech Tagging)

After tokenization, POS tagging assigns grammatical parts of speech to each token, such as noun, verb, adjective, etc. This tagging is important for understanding the syntactic structure of sentences and plays an essential role in disambiguating homonyms (words that have the same spelling but different meanings) based on their usage in a sentence. For example, "run" can be a verb ("I will run") or a noun ("a run in my stockings").

⁠

⁠

Lemmatization

Lemmatization is about reducing words to their base or dictionary form (lemma). Unlike stemming, which just removes word endings, lemmatization considers the context and transforms words based on their actual morphological analysis, making sure that the root word (lemma) belongs to the language. For instance, "better" is lemmatized to "good," and "ran" to "run." Lemmatization is particularly useful in building language models and search engines, where different forms of a word need to be treated as the same term to improve search relevance and performance.

⁠

⁠

Models for Text Representation

Computers need to represent text as a collection of numbers (typically a vector) in order to effectively work with it. We discuss two ways of representing text for NLP.

Bag of words

In short, a bag of words is a data structure that keeps count of the number of occurrences of each word in some text. We can use it to determine the highest occurring words in the text. This can be used for semantic analysis. To convert some text into a vector using this method, we can use the counts of the frequencies of words in the vocabulary as the entries. As an example, if our language consists of the words {car, cloud}, the sentences “car car car car cloud” and “cloud car cloud cloud car car” could be encoded as [4, 1] and [3, 3] respectively.

Embeddings

An embedding is a representation of some text as a vector. The term is often used to refer to methods that use neural networks to create those representations.

When converting text to a vector, we want similar text to be “close” together and dissimilar text to be far apart. There are many different ways of measuring distance between vectors. A simple method is the cosine similarity, which is related to the angle between the vectors. A more acute angle means the vectors are more similar to each other. We will use a SentenceTransformer model to explore this topic.

In the images below, three coordinates of embeddings of different pairs of texts are visualized. You can interact with the demo in the

Colab notebook⁠

(from the Basic Text Embedding section). We can see that similar text point in the same direction, and less similar text point in different directions.

⁠

⁠

⁠

To create an embedding model, the neural network’s parameters are initialized to random values. It is then trained on text that should be similar to tune those parameters to achieve the objective. Many models currently exist that have been trained to embed text like this.

Once an embedding model has been trained, we can use it to embed text and get information about the semantic similarity between pieces of text.

One way we can encode texts that arise from a context-free grammar is to look at the derivations of the texts. If the derivation rules are similar, then the texts likely have some similarity in structure that might relate to meaning. This is even more true for context-sensitive languages. We can think of the parameters of the neural network as learning those key connections between strings derived in a language.