Build one MVP operational process to generate Level 1 and 2 categories for new clients
Set a predefined number of topic representations that are sure to be in documents.
Embedding Methods
Small Efficient Sentence Transformers
For sentence transformer the range of max sequence length is usually from 128 to 512. Transcripts are relatively long text and sentence transformers. To embed long texts using models like all-MiniLM-L6-v2, we'll need to handle the length constraint, as these models typically have a maximum token limit (often around 512 tokens).
Truncation: Truncate the text to the model's maximum token limit (e.g., 512 tokens). This is the simplest approach but may result in loss of important information.
Sliding Window: Use move a window of the maximum token limit across the text with some overlap. Embed each window separately and then aggregate the embeddings (e.g., by averaging).
Or chunk the text into chunks of the maximum token limit and embed each chunk separately.
Average Pooling: Take the average of all these embeddings.
Max Pooling: Take the maximum value for each dimension across all embeddings.
Concatenation: Concatenate the embeddings, but this may increase the dimensionality significantly.
Word and token counts distributions in 5-month call-logger 500K+ transcripts
Distribution of token numbers in one small sample
Dimensionality Reduction
Why we need to do dimensionality reduction? High-dimensional embeddings can be troublesome for many clustering techniques as it gets more difficult to identify meaningful clusters. Clusters are more diffuse and less distinguishable, making it difficult to accurately identify and separate them.
the curse of dimensionality --- a phenomenon that occurs when dealing with high-dimensional data
an exponential growth of the number of possible values within each dimension, finding all subspaces within each dimension becomes increasingly complex → as the number of dimensions grows, the concept of distance between points becomes increasingly less precise
Two well-known methods are Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP; mcinnes2018umap).
Dimensionality won’t perfectly capture high-dimensional data in a lower-dimensional representation. Information will always be lost with this procedure. There is a balance between reducing dimensionality and keeping as much information as possible.
Adapting Models and Methodologies Based on Data and Labels
No Topic Labels:
Approach: Utilize unsupervised learning to generate initial topics. However, human intervention is essential to refine and evaluate the final topic names.
Mix of L1 and L2 Topics:
Approach: Start by clarifying topic relationships and clustering records based on topic categories. For highly specific topics, use supervised methods or cluster-based TFIDF to predict them directly. For general topics that require further specificity, use filter-based clustering to build on top of L1.
Partial L1 Topic Labels:
Approach: Implement semi-supervised learning to identify both existing and new L1 topics. When only some topics are pre-identified and new potential topics are needed, semi-supervised learning with BERTopic can be used to discover new potential topics.
Predefined Topics:
Approach: If current topics are satisfactory and there is no need for new topics, treat the task as a supervised topic modeling problem. This approach focuses only on generating predefined topics, avoiding the time and effort required to evaluate new categories.
Deployment Workflow
Productizing an unsupervised HDBSCAN presents several challenges and requires thorough discussions to ensure successful deployment. A key consideration is whether we expect the model to generate new clusters over time. It is important to note that HDBSCAN assigns topic_id with some inherent randomness, and the default naming convention relies on CTFIDF extraction.
In BERTopic, the prediction_data parameter controls whether the model should save data for making predictions after the initial training. Here is a detailed explanation of the implications of setting prediction_data to False or True:
prediction_data = True:
The model will save the necessary data to make predictions on new documents after the initial training. It is useful when we have a dynamic dataset and need to classify new incoming data into the pre-defined topics.
prediction_data = False:
This means that we will not be able to classify new documents into the existing topics unless you retrain the model.
Setting prediction_data = True in BERTopic allows to classify new documents into the existing topics that were identified during the initial training. However, it does not enable the creation of new clusters or topics for these new documents. The model will only assign the new documents to one of the existing topics based on the patterns it learned during training.
If we need the capability to identify and create new clusters or topics for incoming data, you would need to retrain the model with the new data included. Here’s a summary of the key points:
BERTopic provides a merge models functionality which can be used to combine models and effectively handle new data without retraining the entire model from scratch.
Workflow 1: Retraining the BERTopic Model
Initial Training:
Train your BERTopic model on the initial dataset.
Save the model for future use.
Collect New Data:
Gather new documents or data that need to be classified or used to identify new topics.
Retrain Model:
Combine the new data with the initial dataset.
Retrain the BERTopic model on the combined dataset to identify both existing and new topics.
Save the retrained model for future use.
Workflow 2: Using BERTopic Merge Models Functionality
Initial Training:
Train your BERTopic model on the initial dataset.
Save the model for future use.
Train on New Data:
Train a separate BERTopic model on the new dataset.
Merge Models:
Use the BERTopic.merge function to combine the initial model and the new model.
This allows you to integrate the topics from both models, creating a unified topic space.
Use Merged Model:
The merged model can now be used for further predictions or analysis.