icon picker
Topic Clusters Evaluation and Topics Validation (WIP)

There is no universally superior method for evaluating topic modeling or text clustering. To assess the performance of topic-based clustering, this section explores various techniques including visualization, intrinsic metrics, and extrinsic metrics.
The evaluation process also involves validating the previously refined topic labels for model training. It's recognized that these refined topics might not be entirely comprehensive. For instance, labels refined by the business team could be overly broad, encompassing a wide range of distinct raw topics, which may need further separation. Alternatively, some labels might be too narrow, suggesting a potential to merge closely related topics into a new, unified category.
The validation of raw versus refined topics acknowledges that the refinement process, especially post-June 2023, might include few overly generic topics. Originally, there were over 160 raw topics, but the operational team has since consolidated these into 36 refined topics. The evaluation method should assess whether the refined topics is adequate. We are interested in if any significant topics are missing in the refined list, if some topic labels are too generic, or if some topic labels are inaccurately representing a diverse range of transcripts.
For each topic label, it contains a set of transcripts. We can use certain ways to vectorize them and then check the sparness or closeness or if there are outliers or if there are more than 1 big mass using visualizations.
For every assigned topic, there's a collection of transcripts. We apply certain techniques to vectorize these transcripts, which allows us to analyze their sparsity, proximity, and identify any outliers.
We can determine if there are more than 1 clusters within one topic labels.
We can evaluate if certain observations in one topic label has a closer distance towards centroids of other topic label groups.
Examine the relationship between newly generated topic labels and previous topic labels. This examination should be supported by both quantitative metrics and visualization methods to reinforce subjective assessments. The selection of metrics should be based on their alignment with the business or practical objectives of the model.
Comparison-based score: similarity/cohesion between two networks or sets of clusters (new topics vs old topics)
Review super node - compressed network 2 metrics for selection
Evaluation metrics: metrics are customized with business objective

Topic Evaluation Metric

Explore intrinsic and extrinsic metrics, understand theoeires behind metrics to two or more sets of topics
Intrinsic metric (understand groups with in the data itself): silhouette score, modularity, conductance, perplexity, cohesion, separation, fuzzy partition coefficient, etc.
Network structural properties: cluster size distribution, inter-cluster connectivity, role of particular nodes
Extrinsic (validate one set of topics against known/reference): normalized mutual information (NMI), adjusted rand index (ARI), jaccard index, purity, omega index, modularity, etc.

Topic Coherence

There is one assumption that the latent space discovered by these models is generally meaningful and useful, and that evaluating such assumptions is challenging due to its unsupervised training process.

Perplexity Measure

Perplexity as well is one of the intrinsic evaluation metric, and is widely used for language model evaluation. Perplexity is measured as the normalized log-likelihood of a held-out test set. It measures how well does the model represent or reproduce the statistics of the held-out data or how well the topic model predicts new or unseen data. It reflects the generalization ability of the model.
However, recent studies have shown that predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.
Optimizing for perplexity may not yield human interpretable topics

Topic Coherence

Topic coherence is a measure of how well the words in a topic relate to each other. It reflects the human intuition of what makes a good topic. A high coherence score means that the topic is consistent, clear, and relevant.
This limitation of perplexity measure served as a motivation for more work trying to model the human judgment, and thus Topic Coherence. The concept of topic coherence combines a number of measures into a framework to evaluate the coherence between topics inferred by a model.
A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is “the game is a team sport”, “the game is played with a ball”, “the game demands great physical efforts”.
C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
C_p is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of Fitelson’s coherence
C_uci measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given top words
C_umass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional probability as confirmation measure
C_npmi is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)
C_a is baseed on a context window, a pairwise comparison of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
Eye Balling Models
Top N words
Topics / Documents

Topic Representation

Cluster based TF-IDF

C-TF-IDF is a class-based variant of TF-IDF (c-TF-IDF), that would allow me to extract what makes each set of documents unique compared to the other.
When you apply TF-IDF as usual on a set of documents, what you are basically doing is comparing the importance of words between documents.
The resulting TF-IDF score would demonstrate the important words in a topic.
Pick top N words per topic based on their c-TF-IDF scores. The higher the score, the more representative it should be of its topic as the score is a proxy of information density.

MIND-Lab OCTIS



Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.