Call Categorization Phrase 2
Zinnia L2 Clustering Roadmap
L1 topics, L2 topics and outliers definitions
Level-1 topics: L1 topics are general broader topics and ideally should be mutually exclusive and independent. In reality, we observed some existing topics share similarities and may have hierarchical or subsequent relationships. This is often because keywords and phrases in some L1 topics are shared, and some topics appear in the same conversations. Consequently, there may not be a clear distinction between two L1 topics, which can result in agents mixing them up when assigning L1 labels. Level-2 topics: L2 topics are more specific topics generated by clustering algorithms. Level-2 topics are within the context of one designated Level-1 topic. Level-2 topics, not labeled by agents, could automatically be named using the top-n frequent words and phrases.The number of L2 topics in one L1 topic has not been determined but should not be too large. Outliers: there could be transcripts with excessive irrelevant information or those related to an emerging L1 or L2 topic, for which not enough records have been accumulated. Outliers could be identified and accumulated for further L1 prediction or L2 clustering.
Experiment Chunk-based L2 Clustering
Chunk-based Embedding
During development, we tested middle-size language embedding models that support longer text windows. However, embedding a single transcript could take 1 to 2 minutes, which is too slow for our use cases. Therefore, we shifted to efficient and quick sentence transformers for experiments, which typically have a maximum length limit of 256 to 512 tokens.
Most of our transcripts range from 200 to 2000 tokens. Using efficient sentence transformers, we can embed 1000 to 2000 transcripts in 1 to 2 minutes. In this section, we explore how to build chunks from a single transcript and cluster these chunk-based texts to generate topics for each transcript.
Experiment based embedding max_length=256, overlap=64 split based on token numbers
2000 sample records 2mins 2s Experiment token chunks based embedding max_length=256, overlap=64 split based on word counts
2000 sample records 1min 4 Even though using token counts to create chunks is more relevant, the process of calculating and embedding with token numbers is more complicated and slower compared to using word counts directly. Since one of our objectives is to reduce latency, we use word counts to build chunks.
Aggregate Chunk Embeddings
With chunk embeddings generated with with chunk words count and overlapping words count, We can now aggregated embeddings for each transcript. There are 3 classic aggregation methods, other ways could be explored in the future.
Exp-Aggregate Embedding + UMAP + HDBSCAN for Withdrawal Merged
Select 2000 example records from the merged withdrawal topic. We aim to examine whether clustering aggregated embeddings can formulate useful L2 topics.
Aggregate chunk-based embedding for each transcript using “mean”
df.update_topic_name.unique()
array(['Withdrawal Merged'], dtype=object)
Apply UMAP to reduce 384 dimensions to 5 dimensions
Use UMAP 5-dimension vectors for HDBSCAN clustering
Use 50, 100, 200 min_topic_size
It turns out that the algorithms are identifying few conjunction words, pronouns, definite articles, and prepositions as topics. These words are fundamental to English sentence structures, but they are not meaningful to the business.
This method struggles to extract meaningful topics (keywords and key phrases) from entire text chunks because the relevant words we need might only constitute 2-5% of the total words, with the majority being conjunctions, pronouns, definite articles, and prepositions. We will shift our approach to extract relevant words first and then apply clustering.
Exp-Test Chunk Keywords Clustering
Another approach is to use fine-tuned keyword transformers to extract relevant keywords and key phrases from transcripts. Since the backend pretrained models of library KeyBERT also have maximum length limits, we will apply chunk-based extraction for keywords and phrases.
We are continuing to use as our backend model because it is small, much faster, and achieves performance relatively similar to larger models. Quick test for chunk keywords L2 clustering for Withdrawal Merged
Example functions of building word chunks and extracting keywords using KeyBERT. Hyperparamter values can be tuned in the future.
Processing transcripts takes longer due to the embedding, analyzing, and extracting of keywords and phrases involved.
For 2000 transcripts, using a chunk word count of 256, an overlap word count of 6, top-5 key n-grams for each chunk, and other default parameter values, it takes 6 minutes and 30 seconds to complete chunk-based keywords extraction for 2000 transcripts.
We have identified misclassified tags from deployed anonymization model for date_time and SSN. We can further refine the output by removing inappropriately identified words and phrases from the concatenated keyword chunks.
Applying embedding and UMAP dimensionality reductions to the cleaned keywords from 2000 transcripts.
These processing and clustering methods generate more reliable clustering for Withdrawal Merged topics.
Exp-Add KeyBERT Hyperparameters
Few meaningless clusters (fewer outliers) VS more granular clusters (more outliers)
Train and assign topic ids for records in L1 topic withdrawal merged.
Time ordered topic id predictions and their probabilities.
Exp-
Zinnia L1-L2 Notes and Planing
Level-1 and Level-2 scoping notes: Special Exploration for Queue Callback Window size of small embedding model Speed of large embedding model Meaningless clustering results Proportion of useful information
Model size and randomness Serialization "safetensors" or “pytorch” save model size to 20+ MB Saving with safetensors or pytorch means that embedding and dimensionality reduction models are not fully saved. We found that after reinitializing models saved with safetensors, UMAP always introduces small random changes, even when the random state is controlled. This leads to changes in predictions for text chunks with lower probabilities. A full model can be saved with .pickle safetensorssafetensorsFull model can be save in but the resulting model is rather large (often > 500MB) since all sub-models need to be saved Explicit and specific version control is needed as they typically only run if the environment is exactly the same
L2 Topic Model Training
Clustering Model for Withdrawal
# Initialize cluster object with default and optimization params
cluster_phrases = ClusteringKeyPhrases(
embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',
random_seed=22,
top_n_words=20,
n_gram_range=(1, 3),
min_topic_size=32,
nr_topics=None,
n_neighbors=56,
n_components=9,
verbose=True
)
Clustering Model for Trade
# Initialize cluster object with default and optimization params
cluster_phrases2 = ClusteringKeyPhrases(
embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',
random_seed=22,
top_n_words=20,
n_gram_range=(1, 3),
min_topic_size=28,
nr_topics=None,
n_neighbors=56,
n_components=9,
verbose=True
)
Despite the differences between the L1 topics "Withdrawal" and "Trade," the clustering results reveal shared or highly similar L2 topic names. In fact, a single call transcript can contain multiple small topics that may originate from different L1 topics or be shared across multiple L1 topics.
The idea that each L1 topic has its own specific L2 topics isn’t quite accurate, as L2 topics can show up in different L1 topics within the training transcripts. While some L2 topics might be unique to a particular L1 topic, they shouldn’t generally be assigned to other L1 transcripts.
The quality of L1 topic labels might not be perfect since agents label L1 topics for transcripts while dealing with changing subtopics within a single conversation. The direction of a call can shift due to the nature of different business topics. For example, the L1 form topic is highly relevant to the L1 withdrawal topic, but they can also appear independently during a call.
One way is to build 21 separate clustering models for each L1 topic. But there are two main concerns with this method: first, managing 21 separate L1 topics and over 20 models can be difficult, with issues like version control, increased latency, and potential upload errors. Second, the same L2 topics might be named slightly differently or given different topic IDs in different models, causing confusion for analysts. This could lead to over 1,000 topic names and IDs, making them hard to manage.
To address these concerns, we decided to sample records from all possible L1 categories and create a relatively balanced dataset to train a single clustering model. While we might miss a few specific small topics unique to a single L1 category, this overall model can still identify new topics as we gather more records and retrain with new updated records. Having one foundational model that can predict most possible L2 topics for all L1 topics will facilitate future iterations and improvements. This approach ensures consistency and manageability while allowing for the addition of new topics as needed.
L2 Topic Model Optimization
Visualize Iteration Results for Optimization
Best parameters: [284, 2, 10]
Best objective value: -0.11773888766765594
Progress saved to progress_silhouette_2024-07-14.pkl
Best parameters: [284, 2, 46]
Best objective value: -0.1334598809480667
Progress saved to progress_silhouette_2024-07-15.pkl