Topic modeling algorithms assume that every document is either composed from a set of topics or a specific topic, and every topic is composed of some combination of words. The model will then assess the underlying data to discover “groups” of words that will best describe the document under the given constraints. It involves a set of techniques for discovering and summarizing great quantities of text quickly and in a way that leads to comprehension and insight.
Our objectives can be solved by optimizing following factors:
Preprocess, clean and vectorize transcripts
Dimensional reduction for text vectors
Soft-clustering algorithms to assign multiple topics to one group of transcripts
Visualization and metrics to evaluate topic clustering performances
Text Data Preparation
Topic Modeling Preprocessing
Tokenization: one tokenizer breaks a streaming of textual data into words, terms, sentences, symbols or some other meaningful elements called tokens.
Remove punctuation using regular expressions
Lower casing using regular expressions
Explore and remove redundant words and stop words
Remove meaningless rare words or low-frequency words
Stemming and lemmatization
Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Words that have fewer than 3 characters are removed.
All stopwords are removed.
Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
Words are stemmed — words are reduced to their root form.
Both Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency) are methods for converting text data into numeric representations, making them suitable for various types of modeling in machine learning and natural language processing.
Bag of Words
Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.
For each document, we can create a dictionary reporting how man words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.
bow_corpus =[dictionary.doc2bow(doc)for doc in processed_docs]
TF-IDF
Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
pprint(doc)
break
Word & Document Embedding
Word2Vec and GloVe are techniques that learn to represent words in a high-dimensional vector space where similar words have similar representations. This is achieved by considering the context in which words appear.
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
sentences =["this is a sample sentence","word embeddings are useful","Gensim provides Word2Vec"]
# Tokenization of sentences
tokenized_sentences =[word_tokenize(sentence.lower())for sentence in sentences]
# Training the Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=2)
# Get vector for a word
vector = model.wv['sample']
Doc2Vec extends the idea of Word2Vec to documents. Instead of learning feature representations for words, it learns them for sentences or documents. Python example with Doc2Vec in Gensim:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
# Sample sentences
documents =["this is a sample sentence","document embeddings are useful","Gensim provides Doc2Vec"]
# Tagging documents
tagged_data =[TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)])for i, doc inenumerate(documents)]
# Training the Doc2Vec model
model = Doc2Vec(tagged_data, vector_size=100, window=5, min_count=1, workers=2)
LSA transforms text into numeric vectors through a process involving the creation of a document-term matrix (which is a numeric representation) and then applying Singular Value Decomposition (SVD) to this matrix.
The result is a set of vectors for each document and term in a lower-dimensional space, effectively capturing the key themes or concepts in the data.
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents =["this is a sample document","LSA uses singular value decomposition","text analysis with LSA"]
BERT and other transformer-based models convert text into numeric vectors using a process called embedding. Each word or token is mapped to a vector in a high-dimensional space. The model then processes these vectors using its neural network layers to capture contextual relationships, resulting in a new set of vectors that encapsulate the semantic meaning of words, sentences, or entire documents.
T-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are techniques for reducing the dimensionality of data to make it easier to visualize.
While t-SNE and UMAP themselves do not directly convert text into numeric vectors, they are used to reduce the dimensionality of existing high-dimensional numeric vectors. They are typically applied to embeddings or other forms of high-dimensional text representations to produce a 2D or 3D representation for visualization or further analysis.