icon picker
Zinnia L1-L2 Topic Modeling

Call Categorization Plan
Call Categorization Phrase 2
Objective
1
Objective 1
Build Level-2 call categorization solution for Zinnia
2
Objective 2
Build one MVP operational process to generate Level 1 and 2 categories for new clients
3
There are no rows in this table

Zinnia L2 Clustering Roadmap

L1 topics, L2 topics and outliers definitions

Level-1 topics: L1 topics are general broader topics and ideally should be mutually exclusive and independent. In reality, we observed some existing topics share similarities and may have hierarchical or subsequent relationships. This is often because keywords and phrases in some L1 topics are shared, and some topics appear in the same conversations. Consequently, there may not be a clear distinction between two L1 topics, which can result in agents mixing them up when assigning L1 labels.
Level-2 topics: L2 topics are more specific topics generated by clustering algorithms. Level-2 topics are within the context of one designated Level-1 topic. Level-2 topics, not labeled by agents, could automatically be named using the top-n frequent words and phrases.The number of L2 topics in one L1 topic has not been determined but should not be too large.
Outliers: there could be transcripts with excessive irrelevant information or those related to an emerging L1 or L2 topic, for which not enough records have been accumulated. Outliers could be identified and accumulated for further L1 prediction or L2 clustering.


Flowchart-L1-L2-topic-modeling_area-[1712669803085].png

Experiment Chunk-based L2 Clustering


Chunk-based Embedding

During development, we tested middle-size language embedding models that support longer text windows. However, embedding a single transcript could take 1 to 2 minutes, which is too slow for our use cases. Therefore, we shifted to efficient and quick sentence transformers for experiments, which typically have a maximum length limit of 256 to 512 tokens.
image.png
Most of our transcripts range from 200 to 2000 tokens. Using efficient sentence transformers, we can embed 1000 to 2000 transcripts in 1 to 2 minutes. In this section, we explore how to build chunks from a single transcript and cluster these chunk-based texts to generate topics for each transcript.
Experiment based embedding max_length=256, overlap=64 split based on token numbers
1000 sample records 59s
2000 sample records 2mins 2s
Screenshot 2024-07-04 at 10.06.16 AM.png
Experiment token chunks based embedding max_length=256, overlap=64 split based on word counts
1000 sample records 32s
2000 sample records 1min 4
Screenshot 2024-07-04 at 10.05.35 AM.png
Even though using token counts to create chunks is more relevant, the process of calculating and embedding with token numbers is more complicated and slower compared to using word counts directly. Since one of our objectives is to reduce latency, we use word counts to build chunks.

Aggregate Chunk Embeddings

With chunk embeddings generated with with chunk words count and overlapping words count, We can now aggregated embeddings for each transcript. There are 3 classic aggregation methods, other ways could be explored in the future.
Screenshot 2024-07-04 at 10.13.15 AM.png

Exp-Aggregate Embedding + UMAP + HDBSCAN for Withdrawal Merged

Select 2000 example records from the merged withdrawal topic. We aim to examine whether clustering aggregated embeddings can formulate useful L2 topics.
Aggregate chunk-based embedding for each transcript using “mean”
Screenshot 2024-07-04 at 10.22.00 AM.png
df.update_topic_name.unique()
array(['Withdrawal Merged'], dtype=object)
Apply UMAP to reduce 384 dimensions to 5 dimensions
Screenshot 2024-07-04 at 10.27.31 AM.png

Use UMAP 5-dimension vectors for HDBSCAN clustering
Use 50, 100, 200 min_topic_size
Screenshot 2024-07-04 at 10.30.32 AM.png
Screenshot 2024-07-04 at 10.30.48 AM.png
Screenshot 2024-07-04 at 10.30.53 AM.png
It turns out that the algorithms are identifying few conjunction words, pronouns, definite articles, and prepositions as topics. These words are fundamental to English sentence structures, but they are not meaningful to the business.
This method struggles to extract meaningful topics (keywords and key phrases) from entire text chunks because the relevant words we need might only constitute 2-5% of the total words, with the majority being conjunctions, pronouns, definite articles, and prepositions. We will shift our approach to extract relevant words first and then apply clustering.

Exp-Test Chunk Keywords Clustering

Another approach is to use fine-tuned keyword transformers to extract relevant keywords and key phrases from transcripts. Since the backend pretrained models of library KeyBERT also have maximum length limits, we will apply chunk-based extraction for keywords and phrases.
Screenshot 2024-07-04 at 10.49.22 AM.png
We are continuing to use as our backend model because it is small, much faster, and achieves performance relatively similar to larger models.
Quick test for chunk keywords L2 clustering for Withdrawal Merged
Screenshot 2024-07-04 at 11.40.58 AM.png
Example functions of building word chunks and extracting keywords using KeyBERT. Hyperparamter values can be tuned in the future.
Screenshot 2024-07-04 at 11.49.58 AM.png
Processing transcripts takes longer due to the embedding, analyzing, and extracting of keywords and phrases involved.
For 2000 transcripts, using a chunk word count of 256, an overlap word count of 6, top-5 key n-grams for each chunk, and other default parameter values, it takes 6 minutes and 30 seconds to complete chunk-based keywords extraction for 2000 transcripts.
Screenshot 2024-07-04 at 11.58.55 AM.png
We have identified misclassified tags from deployed anonymization model for date_time and SSN. We can further refine the output by removing inappropriately identified words and phrases from the concatenated keyword chunks.
Screenshot 2024-07-04 at 12.02.41 PM.png
Applying embedding and UMAP dimensionality reductions to the cleaned keywords from 2000 transcripts.
These processing and clustering methods generate more reliable clustering for Withdrawal Merged topics.
Screenshot 2024-07-04 at 12.04.19 PM.png

Exp-Add KeyBERT Hyperparameters


Few meaningless clusters (fewer outliers) VS more granular clusters (more outliers)
Screenshot 2024-07-11 at 1.17.31 PM.png
Screenshot 2024-07-11 at 1.17.54 PM.png

Screenshot 2024-07-11 at 1.18.09 PM.png

Train and assign topic ids for records in L1 topic withdrawal merged.
Screenshot 2024-07-11 at 12.58.42 PM.png
Time ordered topic id predictions and their probabilities.
Screenshot 2024-07-11 at 12.58.52 PM.png

Exp-

Zinnia L1-L2 Notes and Planing


Special Exploration for Queue Callback
Window size of small embedding model
Speed of large embedding model
Length of transcripts
Meaningless clustering results
Proportion of useful information

Model size and randomness
Serialization "safetensors" or “pytorch” save model size to 20+ MB
Saving with safetensors or pytorch means that embedding and dimensionality reduction models are not fully saved.
We found that after reinitializing models saved with safetensors, UMAP always introduces small random changes, even when the random state is controlled. This leads to changes in predictions for text chunks with lower probabilities.
A full model can be saved with .pickle
safetensorssafetensorsFull model can be save in but the resulting model is rather large (often > 500MB) since all sub-models need to be saved
Explicit and specific version control is needed as they typically only run if the environment is exactly the same


L2 Topic Model Training


Screenshot 2024-07-15 at 5.13.07 PM.png

Clustering Model for Withdrawal

# Initialize cluster object with default and optimization params
cluster_phrases = ClusteringKeyPhrases(
embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',
random_seed=22,
top_n_words=20,
n_gram_range=(1, 3),
min_topic_size=32,
nr_topics=None,
n_neighbors=56,
n_components=9,
verbose=True
)



Screenshot 2024-07-15 at 5.24.30 PM.png

Screenshot 2024-07-15 at 5.24.54 PM.png
Screenshot 2024-07-15 at 5.25.21 PM.png

Clustering Model for Trade

# Initialize cluster object with default and optimization params
cluster_phrases2 = ClusteringKeyPhrases(
embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',
random_seed=22,
top_n_words=20,
n_gram_range=(1, 3),
min_topic_size=28,
nr_topics=None,
n_neighbors=56,
n_components=9,
verbose=True
)
Screenshot 2024-07-15 at 5.20.09 PM.png
Screenshot 2024-07-15 at 5.21.06 PM.png
Screenshot 2024-07-15 at 5.25.58 PM.png

Despite the differences between the L1 topics "Withdrawal" and "Trade," the clustering results reveal shared or highly similar L2 topic names. In fact, a single call transcript can contain multiple small topics that may originate from different L1 topics or be shared across multiple L1 topics.
Screenshot 2024-07-15 at 5.34.48 PM.png
The idea that each L1 topic has its own specific L2 topics isn’t quite accurate, as L2 topics can show up in different L1 topics within the training transcripts. While some L2 topics might be unique to a particular L1 topic, they shouldn’t generally be assigned to other L1 transcripts.
The quality of L1 topic labels might not be perfect since agents label L1 topics for transcripts while dealing with changing subtopics within a single conversation. The direction of a call can shift due to the nature of different business topics. For example, the L1 form topic is highly relevant to the L1 withdrawal topic, but they can also appear independently during a call.
One way is to build 21 separate clustering models for each L1 topic. But there are two main concerns with this method: first, managing 21 separate L1 topics and over 20 models can be difficult, with issues like version control, increased latency, and potential upload errors. Second, the same L2 topics might be named slightly differently or given different topic IDs in different models, causing confusion for analysts. This could lead to over 1,000 topic names and IDs, making them hard to manage.
To address these concerns, we decided to sample records from all possible L1 categories and create a relatively balanced dataset to train a single clustering model. While we might miss a few specific small topics unique to a single L1 category, this overall model can still identify new topics as we gather more records and retrain with new updated records. Having one foundational model that can predict most possible L2 topics for all L1 topics will facilitate future iterations and improvements. This approach ensures consistency and manageability while allowing for the addition of new topics as needed.

L2 Topic Model Optimization


Visualize Iteration Results for Optimization

image.png
Best parameters: [284, 2, 10]
Best objective value: -0.11773888766765594
Progress saved to progress_silhouette_2024-07-14.pkl

image.png
Best parameters: [284, 2, 46]
Best objective value: -0.1334598809480667
Progress saved to progress_silhouette_2024-07-15.pkl
image.png
Number of unique clusters VS outlier percent values
Screenshot 2024-07-16 at 9.36.02 AM.png

Primary Model Trained with Optimal Params

Optimized model training with ClusteringKeyPhrases fit_transform_df_phrases
Screenshot 2024-07-16 at 10.12.38 AM.png
44 topics (cluster 0, 1, 2, ..42 and -1 outlier topic)
Screenshot 2024-07-16 at 10.14.34 AM.png
Screenshot 2024-07-16 at 10.14.55 AM.png
Reined Name is a clean version of names derived from default string names in Name column.
Number of chunks in -1 outlier cluster can be reduced by post-model processing (e.g. CTFIDF, similarity search, etc.)
While most clusters are meaningful, a few may not be explicitly so. This could be due to errors in the text transcription model, anonymization model and text processing.
We can refine L2 clusters to be more meaningful or specific by collecting relevant chunk records and merging new trained model with this overall model.
Depending on the input text and random state, if a text chunk is not highly similar to any of the pretrained clusters, the end predictions might not be stable. Reinitiating UMAP dimensionality reduction and HDBSCAN can alter the predictive outcomes to some extent, potentially leading to different assigned topic IDs for text chunks not closely matching any pretrained topics.



Overall Model Trained with Optimal Params

In the transcript data, we found that L1 topics are quite broad, and there could be different specific topics within a single call. There isn't a strict boundary separating the subtopics under different L1 topics, which means clients and agents might discuss various subjects for different reasons. Because of this, our goal is to first build a foundational model that can support more general L2 topic predictions. The relationship between L1 and L2 topics is relatively loose—while they may be correlated, they do not necessarily follow a strict parent-subset structure.
Resample dataset in the Sample Transcripts step to include possible L1 topics as one overall training dataset.
Screenshot 2024-07-31 at 10.37.10 AM.png
Optimized clustering L2 specific topics for all L1 topics.
Screenshot 2024-07-31 at 10.38.37 AM.png
Screenshot 2024-07-31 at 10.40.02 AM.png

Predict L2 with Trained Overall Model

L2 topic ids, representations and refined topic names:
The topic contract_number_agent may not be highly relevant for certain use cases. In such cases, we can delve into more detailed chunk-based topics. Alternatively, the end user can review the total number of occurrences of contract_number_agent within the detailed sequential chunk topics to identify other relevant topics and assess the significance of contract_number for this particular call.
Screenshot 2024-07-31 at 11.15.12 AM.png
for level-2 topic id 5 and 32, we can name them as one topic name "carrier_specific_names"
for level-2 topic 0, 19 and 41, we can rename them as “insurance general info”
rename level-2 topic 10 to “policy_number_pages”

Prediction Pipeline

Screenshot 2024-07-31 at 10.32.14 AM.png
Transcript-based predictions: We generate a final topic prediction for each text chunk within a single transcript. Additionally, we provide a sequence of topic IDs based on the order in which they appear over time.
Sequential Prediction: 4 → -1 → 4 → 19 → 2
Prediction Path:
4: Withdrawal, Form, Free
-1: Letter, Account Form, Email
4: Withdrawal, Form, Free
19: Insurance Policy, Services, Number
2: Tax Withholding, Federal, Taxes
Screenshot 2024-07-31 at 10.14.56 AM.png

Chunk-based predictions:
Screenshot 2024-08-01 at 11.29.41 AM.png

Prediction Examples

Contract number
Screenshot 2024-07-30 at 10.54.59 AM.png
Screenshot 2024-07-30 at 10.55.17 AM.png
Screenshot 2024-07-30 at 10.54.34 AM.png
Screenshot 2024-07-31 at 11.18.20 AM.png
Notes for previous
Notes:

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.