Refine Call Topics

Zinnia L1-L2 Topic Modeling

General L1-L2 Topic modeling (WIP)

Topic Modeling Algorithms

Topic Clusters Evaluation and Topics Validation (WIP)

ChatGPT Prompt Engineering Notes

Explore

General L1-L2 Topic modeling (WIP)

Build one MVP operational process to generate Level 1 and 2 categories for new clients

Set a predefined number of topic representations that are sure to be in documents.

Embedding Methods

Small Efficient Sentence Transformers

For sentence transformer the range of max sequence length is usually from 128 to 512. Transcripts are relatively long text and sentence transformers. To embed long texts using models like all-MiniLM-L6-v2, we'll need to handle the length constraint, as these models typically have a maximum token limit (often around 512 tokens).

Truncation: Truncate the text to the model's maximum token limit (e.g., 512 tokens). This is the simplest approach but may result in loss of important information.

Sliding Window: Use move a window of the maximum token limit across the text with some overlap. Embed each window separately and then aggregate the embeddings (e.g., by averaging).

Or chunk the text into chunks of the maximum token limit and embed each chunk separately.

Average Pooling: Take the average of all these embeddings.

Max Pooling: Take the maximum value for each dimension across all embeddings.

Concatenation: Concatenate the embeddings, but this may increase the dimensionality significantly.

Word and token counts distributions in 5-month call-logger 500K+ transcripts

⁠

⁠

Screenshot 2024-06-12 at 10.32.05 AM.png

⁠

Distribution of token numbers in one small sample

⁠

Screenshot 2024-06-12 at 10.04.41 AM.png

⁠

Dimensionality Reduction

Why we need to do dimensionality reduction? High-dimensional embeddings can be troublesome for many clustering techniques as it gets more difficult to identify meaningful clusters. Clusters are more diffuse and less distinguishable, making it difficult to accurately identify and separate them.

the curse of dimensionality --- a phenomenon that occurs when dealing with high-dimensional data

an exponential growth of the number of possible values within each dimension, finding all subspaces within each dimension becomes increasingly complex → as the number of dimensions grows, the concept of distance between points becomes increasingly less precise

Two well-known methods are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP; mcinnes2018umap).

Dimensionality won’t perfectly capture high-dimensional data in a lower-dimensional representation. Information will always be lost with this procedure. There is a balance between reducing dimensionality and keeping as much information as possible.

UMAP

# Reduce dimensionality using UMAP

umap_model = umap.UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine',

verbose=True, random_state=22, n_jobs=-1)

reduced_embeddings = umap_model.fit_transform(embeddings)

UMAP running time based on the local machine

UMAP Experiment

UMAP Experiment

Number of sentences

Time

Notes

500k

Takes very long time - didn’t finish

Open

280k

Takesvery long time - didn’t finish

Open

100k

1min

Open

80k

50s

Open

60k

38s

Open

50k

19s

Open

There are no rows in this table

⁠

Assumptions & Methods

Adapting Models and Methodologies Based on Data and Labels

⁠

Screenshot 2024-06-12 at 11.22.17 AM.png

⁠

No Topic Labels:

Approach: Utilize unsupervised learning to generate initial topics. However, human intervention is essential to refine and evaluate the final topic names.

Mix of L1 and L2 Topics:

Approach: Start by clarifying topic relationships and clustering records based on topic categories. For highly specific topics, use supervised methods or cluster-based TFIDF to predict them directly. For general topics that require further specificity, use filter-based clustering to build on top of L1.

Partial L1 Topic Labels:

Approach: Implement semi-supervised learning to identify both existing and new L1 topics. When only some topics are pre-identified and new potential topics are needed, semi-supervised learning with BERTopic can be used to discover new potential topics.

Predefined Topics:

Approach: If current topics are satisfactory and there is no need for new topics, treat the task as a supervised topic modeling problem. This approach focuses only on generating predefined topics, avoiding the time and effort required to evaluate new categories.

Deployment Workflow

Productizing an unsupervised HDBSCAN presents several challenges and requires thorough discussions to ensure successful deployment. A key consideration is whether we expect the model to generate new clusters over time. It is important to note that HDBSCAN assigns topic_id with some inherent randomness, and the default naming convention relies on CTFIDF extraction.

In BERTopic, the prediction_data parameter controls whether the model should save data for making predictions after the initial training. Here is a detailed explanation of the implications of setting prediction_data to False or True:

prediction_data = True:

The model will save the necessary data to make predictions on new documents after the initial training. It is useful when we have a dynamic dataset and need to classify new incoming data into the pre-defined topics.

prediction_data = False:

This means that we will not be able to classify new documents into the existing topics unless you retrain the model.

Setting prediction_data = True in BERTopic allows to classify new documents into the existing topics that were identified during the initial training. However, it does not enable the creation of new clusters or topics for these new documents. The model will only assign the new documents to one of the existing topics based on the patterns it learned during training.

If we need the capability to identify and create new clusters or topics for incoming data, you would need to retrain the model with the new data included. Here’s a summary of the key points:

BERTopic provides a merge models functionality which can be used to combine models and effectively handle new data without retraining the entire model from scratch.