Level-1 topics: L1 topics are general broader topics and ideally should be mutually exclusive and independent. In reality, we observed some existing topics share similarities and may have hierarchical or subsequent relationships. This is often because keywords and phrases in some L1 topics are shared, and some topics appear in the same conversations. Consequently, there may not be a clear distinction between two L1 topics, which can result in agents mixing them up when assigning L1 labels.

Level-2 topics: L2 topics are more specific topics generated by clustering algorithms. Level-2 topics are within the context of one designated Level-1 topic. Level-2 topics, not labeled by agents, could automatically be named using the top-n frequent words and phrases.The number of L2 topics in one L1 topic has not been determined but should not be too large.

Outliers: there could be transcripts with excessive irrelevant information or those related to an emerging L1 or L2 topic, for which not enough records have been accumulated. Outliers could be identified and accumulated for further L1 prediction or L2 clustering.

⁠

Flowchart-L1-L2-topic-modeling_area-[1712669803085].png

⁠

Experiment Chunk-based L2 Clustering

Chunk-based Embedding

During development, we tested middle-size language embedding models that support longer text windows. However, embedding a single transcript could take 1 to 2 minutes, which is too slow for our use cases. Therefore, we shifted to efficient and quick sentence transformers for experiments, which typically have a maximum length limit of 256 to 512 tokens.

⁠

Most of our transcripts range from 200 to 2000 tokens. Using efficient sentence transformers, we can embed 1000 to 2000 transcripts in 1 to 2 minutes. In this section, we explore how to build chunks from a single transcript and cluster these chunk-based texts to generate topics for each transcript.

Experiment based embedding max_length=256, overlap=64 split based on token numbers

1000 sample records 59s

2000 sample records 2mins 2s

⁠

Screenshot 2024-07-04 at 10.06.16 AM.png

⁠

Experiment token chunks based embedding max_length=256, overlap=64 split based on word counts

1000 sample records 32s

2000 sample records 1min 4

⁠

Screenshot 2024-07-04 at 10.05.35 AM.png

⁠

Even though using token counts to create chunks is more relevant, the process of calculating and embedding with token numbers is more complicated and slower compared to using word counts directly. Since one of our objectives is to reduce latency, we use word counts to build chunks.

Aggregate Chunk Embeddings

With chunk embeddings generated with with chunk words count and overlapping words count, We can now aggregated embeddings for each transcript. There are 3 classic aggregation methods, other ways could be explored in the future.

⁠

Screenshot 2024-07-04 at 10.13.15 AM.png

⁠

Exp-Aggregate Embedding + UMAP + HDBSCAN for Withdrawal Merged

Select 2000 example records from the merged withdrawal topic. We aim to examine whether clustering aggregated embeddings can formulate useful L2 topics.

Aggregate chunk-based embedding for each transcript using “mean”

⁠

Screenshot 2024-07-04 at 10.22.00 AM.png

⁠

df.update_topic_name.unique()

array(['Withdrawal Merged'], dtype=object)

Apply UMAP to reduce 384 dimensions to 5 dimensions

⁠

Screenshot 2024-07-04 at 10.27.31 AM.png

⁠

Use UMAP 5-dimension vectors for HDBSCAN clustering

Use 50, 100, 200 min_topic_size

⁠

Screenshot 2024-07-04 at 10.30.32 AM.png

⁠

Screenshot 2024-07-04 at 10.30.48 AM.png

⁠

Screenshot 2024-07-04 at 10.30.53 AM.png

⁠

It turns out that the algorithms are identifying few conjunction words, pronouns, definite articles, and prepositions as topics. These words are fundamental to English sentence structures, but they are not meaningful to the business.

This method struggles to extract meaningful topics (keywords and key phrases) from entire text chunks because the relevant words we need might only constitute 2-5% of the total words, with the majority being conjunctions, pronouns, definite articles, and prepositions. We will shift our approach to extract relevant words first and then apply clustering.

Exp-Test Chunk Keywords Clustering

Another approach is to use fine-tuned keyword transformers to extract relevant keywords and key phrases from transcripts. Since the backend pretrained models of library KeyBERT also have maximum length limits, we will apply chunk-based extraction for keywords and phrases.

⁠

Screenshot 2024-07-04 at 10.49.22 AM.png

⁠

We are continuing to use

"sentence-transformers/all-MiniLM-L6-v2"⁠

as our backend model because it is small, much faster, and achieves performance relatively similar to larger models.

Quick test for chunk keywords L2 clustering for Withdrawal Merged

⁠

Screenshot 2024-07-04 at 11.40.58 AM.png

⁠

Example functions of building word chunks and extracting keywords using KeyBERT. Hyperparamter values can be tuned in the future.

⁠

Screenshot 2024-07-04 at 11.49.58 AM.png

⁠

Processing transcripts takes longer due to the embedding, analyzing, and extracting of keywords and phrases involved.

For 2000 transcripts, using a chunk word count of 256, an overlap word count of 6, top-5 key n-grams for each chunk, and other default parameter values, it takes 6 minutes and 30 seconds to complete chunk-based keywords extraction for 2000 transcripts.

⁠

Screenshot 2024-07-04 at 11.58.55 AM.png

⁠

We have identified misclassified tags from deployed anonymization model for date_time and SSN. We can further refine the output by removing inappropriately identified words and phrases from the concatenated keyword chunks.

⁠

Screenshot 2024-07-04 at 12.02.41 PM.png

⁠

Applying embedding and UMAP dimensionality reductions to the cleaned keywords from 2000 transcripts.

These processing and clustering methods generate more reliable clustering for Withdrawal Merged topics.

⁠

Screenshot 2024-07-04 at 12.04.19 PM.png

⁠

Exp-Add KeyBERT Hyperparameters

Few meaningless clusters (fewer outliers) VS more granular clusters (more outliers)

⁠

⁠

⁠

⁠

Train and assign topic ids for records in L1 topic withdrawal merged.

⁠

Screenshot 2024-07-11 at 12.58.42 PM.png

⁠

Time ordered topic id predictions and their probabilities.

⁠

Screenshot 2024-07-11 at 12.58.52 PM.png

⁠

Exp-

Zinnia L1-L2 Notes and Planing

Data strategic planing:

https://docs.google.com/spreadsheets/d/18Bg1Yhsa6t44xc1CfwUk-EKEEgMsaq1xR0yx3RJM3jA/edit#gid=1587070162⁠

⁠

Level-1 and Level-2 scoping notes:

https://docs.google.com/document/d/1fEpCjnnobtTjO4sla98-WVBbkqDqhV3WH9ypDaKmtoU/edit#heading=h.n3nc2s4oh875⁠

⁠

Special Exploration for Queue Callback

Window size of small embedding model

Speed of large embedding model

Length of transcripts

Meaningless clustering results

Proportion of useful information

Model size and randomness

Serialization "safetensors" or “pytorch” save model size to 20+ MB

Saving with safetensors or pytorch means that embedding and dimensionality reduction models are not fully saved.

We found that after reinitializing models saved with safetensors, UMAP always introduces small random changes, even when the random state is controlled. This leads to changes in predictions for text chunks with lower probabilities.

A full model can be saved with .pickle

safetensorssafetensorsFull model can be save in but the resulting model is rather large (often > 500MB) since all sub-models need to be saved

Explicit and specific version control is needed as they typically only run if the environment is exactly the same

L2 Topic Model Training

⁠

⁠

Clustering Model for Withdrawal

# Initialize cluster object with default and optimization params

cluster_phrases = ClusteringKeyPhrases(

embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',

random_seed=22,

top_n_words=20,

n_gram_range=(1, 3),

min_topic_size=32,

nr_topics=None,

n_neighbors=56,

n_components=9,

verbose=True

)

⁠

⁠

⁠

⁠

Clustering Model for Trade

# Initialize cluster object with default and optimization params

cluster_phrases2 = ClusteringKeyPhrases(

embedding_model_name='sentence-transformers/all-MiniLM-L6-v2',

random_seed=22,

top_n_words=20,

n_gram_range=(1, 3),

min_topic_size=28,

nr_topics=None,

n_neighbors=56,

n_components=9,

verbose=True

)

⁠

⁠

⁠

⁠

Despite the differences between the L1 topics "Withdrawal" and "Trade," the clustering results reveal shared or highly similar L2 topic names. In fact, a single call transcript can contain multiple small topics that may originate from different L1 topics or be shared across multiple L1 topics.

⁠

⁠

The idea that each L1 topic has its own specific L2 topics isn’t quite accurate, as L2 topics can show up in different L1 topics within the training transcripts. While some L2 topics might be unique to a particular L1 topic, they shouldn’t generally be assigned to other L1 transcripts.

The quality of L1 topic labels might not be perfect since agents label L1 topics for transcripts while dealing with changing subtopics within a single conversation. The direction of a call can shift due to the nature of different business topics. For example, the L1 form topic is highly relevant to the L1 withdrawal topic, but they can also appear independently during a call.

One way is to build 21 separate clustering models for each L1 topic. But there are two main concerns with this method: first, managing 21 separate L1 topics and over 20 models can be difficult, with issues like version control, increased latency, and potential upload errors. Second, the same L2 topics might be named slightly differently or given different topic IDs in different models, causing confusion for analysts. This could lead to over 1,000 topic names and IDs, making them hard to manage.

To address these concerns, we decided to sample records from all possible L1 categories and create a relatively balanced dataset to train a single clustering model. While we might miss a few specific small topics unique to a single L1 category, this overall model can still identify new topics as we gather more records and retrain with new updated records. Having one foundational model that can predict most possible L2 topics for all L1 topics will facilitate future iterations and improvements. This approach ensures consistency and manageability while allowing for the addition of new topics as needed.

L2 Topic Model Optimization

Visualize Iteration Results for Optimization

⁠

Best parameters: [284, 2, 10]

Best objective value: -0.11773888766765594

Progress saved to progress_silhouette_2024-07-14.pkl

⁠

Best parameters: [284, 2, 46]

Best objective value: -0.1334598809480667

Progress saved to progress_silhouette_2024-07-15.pkl

⁠

Number of unique clusters VS outlier percent values

⁠

⁠

Primary Model Trained with Optimal Params

Optimized model training with ClusteringKeyPhrases fit_transform_df_phrases

⁠

Screenshot 2024-07-16 at 10.12.38 AM.png

⁠

44 topics (cluster 0, 1, 2, ..42 and -1 outlier topic)

⁠

Screenshot 2024-07-16 at 10.14.34 AM.png

⁠

Screenshot 2024-07-16 at 10.14.55 AM.png

⁠

Reined Name is a clean version of names derived from default string names in Name column.

Number of chunks in -1 outlier cluster can be reduced by post-model processing (e.g. CTFIDF, similarity search, etc.)

While most clusters are meaningful, a few may not be explicitly so. This could be due to errors in the text transcription model, anonymization model and text processing.

We can refine L2 clusters to be more meaningful or specific by collecting relevant chunk records and merging new trained model with this overall model.

Depending on the input text and random state, if a text chunk is not highly similar to any of the pretrained clusters, the end predictions might not be stable. Reinitiating UMAP dimensionality reduction and HDBSCAN can alter the predictive outcomes to some extent, potentially leading to different assigned topic IDs for text chunks not closely matching any pretrained topics.

Overall Model Trained with Optimal Params

In the transcript data, we found that L1 topics are quite broad, and there could be different specific topics within a single call. There isn't a strict boundary separating the subtopics under different L1 topics, which means clients and agents might discuss various subjects for different reasons. Because of this, our goal is to first build a foundational model that can support more general L2 topic predictions. The relationship between L1 and L2 topics is relatively loose—while they may be correlated, they do not necessarily follow a strict parent-subset structure.

Resample dataset in the Sample Transcripts step to include possible L1 topics as one overall training dataset.

⁠

Screenshot 2024-07-31 at 10.37.10 AM.png

⁠

Optimized clustering L2 specific topics for all L1 topics.

⁠

Screenshot 2024-07-31 at 10.38.37 AM.png

⁠

Screenshot 2024-07-31 at 10.40.02 AM.png

⁠

Predict L2 with Trained Overall Model

L2 topic ids, representations and refined topic names:

The topic contract_number_agent may not be highly relevant for certain use cases. In such cases, we can delve into more detailed chunk-based topics. Alternatively, the end user can review the total number of occurrences of contract_number_agent within the detailed sequential chunk topics to identify other relevant topics and assess the significance of contract_number for this particular call.

⁠

Screenshot 2024-07-31 at 11.15.12 AM.png

⁠

for level-2 topic id 5 and 32, we can name them as one topic name "carrier_specific_names"

for level-2 topic 0, 19 and 41, we can rename them as “insurance general info”

rename level-2 topic 10 to “policy_number_pages”

Prediction Pipeline

⁠

Screenshot 2024-07-31 at 10.32.14 AM.png

⁠

Transcript-based predictions: We generate a final topic prediction for each text chunk within a single transcript. Additionally, we provide a sequence of topic IDs based on the order in which they appear over time.

Sequential Prediction: 4 → -1 → 4 → 19 → 2

Prediction Path:

4: Withdrawal, Form, Free

-1: Letter, Account Form, Email

4: Withdrawal, Form, Free

19: Insurance Policy, Services, Number

2: Tax Withholding, Federal, Taxes

⁠