icon picker
Topic Modeling Algorithms

Topic modeling algorithms assume that every document is either composed from a set of topics or a specific topic, and every topic is composed of some combination of words. The model will then assess the underlying data to discover “groups” of words that will best describe the document under the given constraints. It involves a set of techniques for discovering and summarizing great quantities of text quickly and in a way that leads to comprehension and insight.
Our objectives can be solved by optimizing following factors:
Preprocess, clean and vectorize transcripts
Dimensional reduction for text vectors
Soft-clustering algorithms to assign multiple topics to one group of transcripts
Visualization and metrics to evaluate topic clustering performances

Text Data Preparation


Topic Modeling Preprocessing

Tokenization: one tokenizer breaks a streaming of textual data into words, terms, sentences, symbols or some other meaningful elements called tokens.
Remove punctuation using regular expressions
Lower casing using regular expressions
Explore and remove redundant words and stop words
Remove meaningless rare words or low-frequency words
Stemming and lemmatization
Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Words that have fewer than 3 characters are removed.
All stopwords are removed.
Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
Words are stemmed — words are reduced to their root form.
# perform lemmatize and stem preprocessing steps on the data set.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result

# Select a document to preview after preprocessing.
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

Convert Text to Numerical

Both Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency) are methods for converting text data into numeric representations, making them suitable for various types of modeling in machine learning and natural language processing.

Bag of Words

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.
For each document, we can create a dictionary reporting how man words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
# Gensim doc2bow
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

TF-IDF

Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
pprint(doc)
break

Word & Document Embedding

Word2Vec and GloVe are techniques that learn to represent words in a high-dimensional vector space where similar words have similar representations. This is achieved by considering the context in which words appear.

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
sentences = ["this is a sample sentence", "word embeddings are useful", "Gensim provides Word2Vec"]
# Tokenization of sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Training the Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=2)
# Get vector for a word
vector = model.wv['sample']

Doc2Vec extends the idea of Word2Vec to documents. Instead of learning feature representations for words, it learns them for sentences or documents. Python example with Doc2Vec in Gensim:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
documents = ["this is a sample sentence", "document embeddings are useful", "Gensim provides Doc2Vec"]
# Tagging documents
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]
# Training the Doc2Vec model
model = Doc2Vec(tagged_data, vector_size=100, window=5, min_count=1, workers=2)
# Get vector for a document
vector = model.infer_vector(word_tokenize("sample sentence"))

Latent Semantic Analysis (LSA)

LSA transforms text into numeric vectors through a process involving the creation of a document-term matrix (which is a numeric representation) and then applying Singular Value Decomposition (SVD) to this matrix.
The result is a set of vectors for each document and term in a lower-dimensional space, effectively capturing the key themes or concepts in the data.
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = ["this is a sample document", "LSA uses singular value decomposition", "text analysis with LSA"]

# Create a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(documents)

# Perform LSA
lsa = TruncatedSVD(n_components=2)
X_lsa = lsa.fit_transform(X)


Transformer Models

BERT and other transformer-based models convert text into numeric vectors using a process called embedding. Each word or token is mapped to a vector in a high-dimensional space. The model then processes these vectors using its neural network layers to capture contextual relationships, resulting in a new set of vectors that encapsulate the semantic meaning of words, sentences, or entire documents.
from transformers import BertModel, BertTokenizer
import torch
# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Here is some text to encode"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Dimensionality Reduction Techniques

T-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are techniques for reducing the dimensionality of data to make it easier to visualize.
While t-SNE and UMAP themselves do not directly convert text into numeric vectors, they are used to reduce the dimensionality of existing high-dimensional numeric vectors. They are typically applied to embeddings or other forms of high-dimensional text representations to produce a 2D or 3D representation for visualization or further analysis.
from sklearn.manifold import TSNE
# Assuming X is your high-dimensional data matrix
X_embedded = TSNE(n_components=2).fit_transform(X)

Topic Modeling


A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts
A Survey of Topic Modeling in Text Mining

Screenshot 2024-01-10 at 2.50.09 PM.png
Screenshot 2024-01-10 at 2.50.27 PM.png

BERTopic

Applied sentence embedding to get semantic meaning in sentences and syntactic overlap
Different with LDA approaches, we can generate number of topics automatically
Can experiment different transformer models/sentence transfomers to embed documents
Sentence transformer is meant to work with noises, stop words, etc. So no stop words are supposed to be removed.
Experiment inserting soft/fuzz clustering methods in BERTopic

For each document, we can also visualize the probability of that document belong to each possible topic
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.