icon picker
Topic Modeling Algorithms

Topic modeling algorithms assume that every document is either composed from a set of topics or a specific topic, and every topic is composed of some combination of words. The model will then assess the underlying data to discover “groups” of words that will best describe the document under the given constraints. It involves a set of techniques for discovering and summarizing great quantities of text quickly and in a way that leads to comprehension and insight.
Our objectives can be solved by optimizing following factors:
Preprocess, clean and vectorize transcripts
Dimensional reduction for text vectors
Soft-clustering algorithms to assign multiple topics to one group of transcripts
Visualization and metrics to evaluate topic clustering performances

Text Data Preparation


Topic Modeling Preprocessing

Tokenization: one tokenizer breaks a streaming of textual data into words, terms, sentences, symbols or some other meaningful elements called tokens.
Remove punctuation using regular expressions
Lower casing using regular expressions
Explore and remove redundant words and stop words
Remove meaningless rare words or low-frequency words
Stemming and lemmatization
Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Words that have fewer than 3 characters are removed.
All stopwords are removed.
Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
Words are stemmed — words are reduced to their root form.
# perform lemmatize and stem preprocessing steps on the data set.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result

# Select a document to preview after preprocessing.
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

Convert Text to Numerical

Both Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency) are methods for converting text data into numeric representations, making them suitable for various types of modeling in machine learning and natural language processing.

Bag of Words

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.
For each document, we can create a dictionary reporting how man words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
# Gensim doc2bow
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

TF-IDF

Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
pprint(doc)
break

Word & Document Embedding

Word2Vec and GloVe are techniques that learn to represent words in a high-dimensional vector space where similar words have similar representations. This is achieved by considering the context in which words appear.

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
sentences = ["this is a sample sentence", "word embeddings are useful", "Gensim provides Word2Vec"]
# Tokenization of sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Training the Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=2)
# Get vector for a word
vector = model.wv['sample']

Doc2Vec extends the idea of Word2Vec to documents. Instead of learning feature representations for words, it learns them for sentences or documents. Python example with Doc2Vec in Gensim:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
documents = ["this is a sample sentence", "document embeddings are useful", "Gensim provides Doc2Vec"]
# Tagging documents
tagged_data = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]
# Training the Doc2Vec model
model = Doc2Vec(tagged_data, vector_size=100, window=5, min_count=1, workers=2)
# Get vector for a document
vector = model.infer_vector(word_tokenize("sample sentence"))

Latent Semantic Analysis (LSA)

LSA transforms text into numeric vectors through a process involving the creation of a document-term matrix (which is a numeric representation) and then applying Singular Value Decomposition (SVD) to this matrix.
The result is a set of vectors for each document and term in a lower-dimensional space, effectively capturing the key themes or concepts in the data.
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = ["this is a sample document", "LSA uses singular value decomposition", "text analysis with LSA"]

# Create a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X = tfidf_vectorizer.fit_transform(documents)

# Perform LSA
lsa = TruncatedSVD(n_components=2)
X_lsa = lsa.fit_transform(X)


Transformer Models

BERT and other transformer-based models convert text into numeric vectors using a process called embedding. Each word or token is mapped to a vector in a high-dimensional space. The model then processes these vectors using its neural network layers to capture contextual relationships, resulting in a new set of vectors that encapsulate the semantic meaning of words, sentences, or entire documents.
from transformers import BertModel, BertTokenizer
import torch
# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Here is some text to encode"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Dimensionality Reduction Techniques

T-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are techniques for reducing the dimensionality of data to make it easier to visualize.
While t-SNE and UMAP themselves do not directly convert text into numeric vectors, they are used to reduce the dimensionality of existing high-dimensional numeric vectors. They are typically applied to embeddings or other forms of high-dimensional text representations to produce a 2D or 3D representation for visualization or further analysis.
from sklearn.manifold import TSNE
# Assuming X is your high-dimensional data matrix
X_embedded = TSNE(n_components=2).fit_transform(X)

Topic Modeling


A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts
A Survey of Topic Modeling in Text Mining

Screenshot 2024-01-10 at 2.50.09 PM.png
Screenshot 2024-01-10 at 2.50.27 PM.png

BERTopic

Applied sentence embedding to get semantic meaning in sentences and syntactic overlap
Different with LDA approaches, we can generate number of topics automatically
Can experiment different transformer models/sentence transfomers to embed documents
Sentence transformer is meant to work with noises, stop words, etc. So no stop words are supposed to be removed.
Experiment inserting soft/fuzz clustering methods in BERTopic

For each document, we can also visualize the probability of that document belong to each possible topic
Screenshot 2024-01-10 at 4.52.58 PM.png
Merging topics or topic reduction: reduce the number of topics by merging pairs of topics that are most similar to each other, as indicated by the cosine similarity between c-TF-IDF vectors.
Can also set up the number of topics to automatically reduce topics: nr_topics=20, nr_topics="auto" and efficiently reduce the number of topics with trained model, topics and probabilities
BERTopic is a topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
c-TF-IDF: class-based TF-IDF
# Update topic representation by increasing n-gram range and removing english stopwords
model.update_topics(docs, topics, n_gram_range=(1, 3), stop_words="english")
# use a custom CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 3), stop_words="english")
model.update_topics(docs, topics, vectorizer=cv)
Custom embeddings for documents: SentenceTransformer, Doc2Vec, etc.
it is also possible to use TF-IDF on the documents and use them as input for BERTopic. fit_transform is slow because reducing the dimensionality of a large sparse matrix takes some time.
from bertopic import BERTopic
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TF-IDF sparse matrix
vectorizer = TfidfVectorizer(min_df=5)
embeddings = vectorizer.fit_transform(docs)
# Model
model = BERTopic(stop_words="english")
topics, probabilities = model.fit_transform(docs, embeddings)
Loading…
Loading…
LDA represents each document as a bag of words, where the order of the words is not considered. BERTopic represents each document using a dense vector that captures the meaning of the text.
Handle word embeddings: LDA uses a fixed set of pre-trained word embeddings, while BERTopic fine-tunes a pre-trained BERT model to encode the documents into dense vectors.
LDA requires the user to specify the number of topics to be extracted, whereas BERTopic automatically determines the number of topics (could be soft HDBSCAN)
BERTopic can better capture the nuances and complexities of language. BERT has been trained on a large text corpus and can encode the documents into dense vectors that capture their semantic meaning. BERTopic can also handle long documents, whereas LDA struggles with longer texts.
One disadvantage of BERTopic is that it can be computationally expensive and require significant resources to fine-tune the BERT model and cluster the documents. LDA, on the other hand, is relatively simple and computationally efficient.
Compare BERTopic, Top2Vect and CTM by topic coherence, topic diversity and computational time.

LDA

LDA approaches consider words themselves without considering the actual semantic meaning in sentences
Need to specify the number of topics
LDA is a probabilistic model that represents documents as mixtures of topics. It starts with a document-term frequency matrix (numeric) and uses this to infer the distribution of topics in documents and the distribution of words in topics.
The output is a set of numeric vectors representing the probability distribution of topics in each document.
There are 2 parts in LDA:
The words that belong to a document, that we already know.
The words that belong to a topic or the probability of words belonging into a topic, that we need to calculate. (LDA algorithm aims to find words that belong to one topic)

Guided LDA

Documentation
Change LDA to Semi-supervised GuidedLDA

Labeled LDA

Loading…
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.