Roadmap
Ground Truth Topics
What call topics can be considered as representations of ground-truth?
we consider two types of topics:
During one call, an agent either chooses call topics from a dropdown list in or input topics manually. Topics derived from agent-generated call topics were further refined by domain knowledge. These refined topics can be summarized based on managers’ expectations with their domain knowledge or be linked to frequently occurring words, phrases, or sentences that are commonly found in historical transcripts. Topics that are newly identified and relevant to the content of recent transcripts. These new topics can only be extracted from corresponding words, phrases, or sentences in the most recent transcripts. Evaluating above call topics is important. Previously refined topics and newly identified call topics are used as labels of the target variable for exploratory analysis and topic modeling. External changes can impact the accuracy of topic assignment. Given that call topics are influenced by subjective interpretations of transcripts and domain knowledge, we use validation and visualization techniques for iterative assessment and enhancement of chosen topics. For above two types of topics, if they successfully pass the validation process and demonstrate meaningful connections with transcripts in visualizations are assumed to be representations of ground-truth.
What call topics cannot be considered as representations of ground-truth?
Call topics that are nonsense, filled with irrelevant noise or unlikely to provide meaningful insights for the business team. A topic may not be considered as one accurate representation of ground-truth if most of its transcripts show inconsistent patterns.
What are the relationships between ground-truth topics, refined topics and new topics?
Utilizing ground-truth topics is crucial for evaluating topics generated by machine learning techniques and GPT prompt engineering. These include:
Refined topics: topics that are previously identified and refined by business tams, New topics: topics newly identified as meaningful to business teams. Representations of the ground-truth topics: topics that are refined and subsequently validated through evaluation can be considered as representations of the ground-truth topics Initial refined topics may not be fully complete since we found that some refined topics are too general and some are very constrained. Identifying and extracting new meaningful topics also require domain expertise, subjective understanding and judgement, especially when differentiating actual emergent topics from unexpected noises. Therefore, processes of finalizing refined topics and identifying new topics require an iterative improvement with human involvement (e.g. domain knowledge, proper evaluation metrics, reliable visualizations, etc.) to guarantee the quality of the final outcomes.
Our development and experimental focuses are mainly accurately predicting two types of topics: (1) Evaluate the quality of refined topics and build ML models with representations of ground-truth to assign topics for input transcripts, (2) Among new topics that differ from previous representations of ground-truth, identify topics that represent emerging trends rather than meaningless noise. Specifically
First, we evaluate the quality and validate previously refined call topics. With validated topic labels or representations of ground-truth, we utilize machine learning models to predict topics based on input transcripts. Evaluating and validating refined topics ensure that data exploration, text mining, and model building are proceeding correctly. We also need to be cautious that recently added transcripts may include some new emergent topics.
Secondly, we aim to deliver solutions can generate new topics. Topics evaluation and refinement ensure new emergent topics are valuable to business teams and not merely noise. Objectives and Approaches
What are long-term objectives?
In the long run, we aim to develop one system that combines custom rules, embeddings, machine learning, and GPT to automatically generate accurate call topics through iterative improvements. This will involve creating a tailored machine learning model for topic prediction, implementing a validation layer to evaluate the quality of assigned topics combined with human knowledge, and fine-tuning GPT to effectively identify previously validated topics and newly emerging ones.
For short-term objectives, we are focusing on two aspects:
1.1 In first aspect, we engage in data preprocessing and cleaning to assess previously identified topics using selected evaluation metrics and visualizations. The objective is to establish a validation layer that incorporates chosen visual methods and evaluation metrics for effective human evaluation.
1.2 Following this, delve into experimenting with text embeddings, dimensionality reduction techniques, and soft-clustering algorithms. The aim is to build a machine learning system that leverages transcript data and previously refined topics to predict a range of topics for each transcript. The final output will enable us to rank topics for each transcript, allowing us to categorize them as primary, secondary, and tertiary, among others.
2. Upon completing the first aspect, we work on prompt engineering and fine-tuning to ensure it produces relatively stable topics. Most of GPT generated results should be similar to previously refined topics.
In summary, exploring various evaluation methods is essential to identify ground-truth topics for a set of transcripts. Machine learning solutions can be customized to provide stable and reliable outputs. GPT introduces the possibility of identifying new topics that reflect emerging trends. By assessing and integrating each component, we aim for more comprehensive and stable outcomes. Through continuous iterations, we plan to further refine the capabilities of both customized machine learning and GPT, aligning with our long-term objectives.
How do we plan to approach the problem?
Build topic modeling solution with agent-generated topics:
After preprocessing and cleaning transcripts, topics and other variables, we work on exploring visualization methods and evaluation metrics to estimate the quality of refined call topics. Our approach is to synthesize qualitative analysis, visualizations, and quantitative evaluation metrics to validate and refine topics for machine learning experiments. This step may incorporate additional variables and data sources, such as raw and simplified GPT-generated topics, to derive deeper insights and enhance visualization capabilities.
Having validated call topic labels, we further delve into the historical transcripts to learn the associations between the validated topics and relevant key words, phrases, or sentences in cleaned transcripts. We then experiment and optimize machine learning models to categorize transcripts into topic clusters. Our objective is to develop a machine learning system capable of assigning multiple topics to each transcript and identifying topics that are different from previously known topics.
Optimize GPT-generated topics:
To ensure GPT produces outputs that are stable, reliable, and of high quality, we implement prompt engineering and fine-tuning of GPT's hyperparameters. A considerable portion of GPT's topics should correspond with our validated refined topics or representations of ground-truth.
Furthermore, GPT's newly generated topics must be clear and distinct to be practical. One output refinement layer can be used to evaluate the quality of GPT-generated results. With this output refinement layer and evaluation functions developed for topic modeling, we can visualize and assess GPT's output. This refinement layer also help detect and categorize meaningless and nonsensical patterns, using both qualitative and quantitative metrics. Outputs classified as noise are preserved for enhancing future noise-pattern detection. Conversely, outputs that differ from existing refined topics yet are coherent are considered potential new topics. These emergent topics can be reviewed and then annotated to refine machine learning methods.
High-level Tasks
Step 1
In step 1, we extract and merge datasets for data exploration and visualizations. We then build data preprocessing and cleaning functions for following topic modeling experiments. One important part in this step is to explore visualization methods and evaluation metrics to assess topic-based transcripts. Some subtasks includes converting text data into different vector representations, building visualizations that aid in qualitative analysis, and selecting metrics that measure the coherence and relevance of the topic clusters.
In step 2, we first focus on delving into pattern discovery within transcripts that have assigned topics. We extract insights using both visual and evaluation functions, and then carry out a series of experiments for machine learning solutions (e.g. BERTopic, Top2vec, LDA, Guided LDA, NMF, etc.) Our goal is to capture connections between the frequency of specific keywords, phrases, or structural elements within a given cluster or topic of transcripts, as well as to examine their relationship with associated variables such as call-related data, the refined outputs from GPT, and so on.
In steps 3 & 4 We focus on improving and refining the performance of ChatGPT and collaboration with the infrastructure team to conduct prompt engineering, fine-tune hyperparameters, and construct output refinement layer.
In step 5, we analyze and coordinate the strengths and weaknesses of rule-based methods, ML models and ChatGPT. We will collaborate with other teams to develop an integrated framework. This framework will integrate rule-based functions, ML predictive models, input prompts, output refinement layer as well as GPT-generated results. The objective is to effectively combines the unique advantages of each approach, ensuring both stability and optimal performance in topic generation.
Agent Topics Modeling (Step 1 & 2)
Code Review and Data Extraction
This step involves reviewing earlier notebooks and Python code created for Zowie, with the aim to identify, extract, and integrate datasets for subsequent exploration.
Data Preprocessing and Cleaning
This step contains preprocessing and cleaning the merged dataset for topic modeling, refining the dataset for tokennizations, and processing transcripts for sentence transformers.
Topics Visualization and Evaluation
This step includes exploring and building visualization methods for evaluate topic-based transcripts. It also involves developing and metrics to assess previous agent's topics, refined topics, GPT topic and simplified GPT topics and other topic modeling results.
Machine Learning Experiments
In this step, two distinct topic models are trained and optimized: BERTopic and Guided LDA. BERTopic utilizes transformers, dimensionality reduction, fuzzy/soft clustering, and topic representations. Guided LDA is based on the Dirichlet distribution and incorporates the use of pre-defined seed words.
For additional information, please refer to the details provided in the following link:
GPT Topics Generation (Step 3 & 4)
Step 3. GPT Prompt engineering — refined topics in historical transcripts
3.1 Set up one development environment using same or similar model API for valid experimentation with various prompts. Ensure the transcripts dataset exclusively contains historically refined topics, excluding any new ones. Set up development environment to test GPT correctly 3.2 Explore a range of prompt inputs and adjust hyperparameter values in GPT to reduce randomness, with the goal of replicating identical or closely resemble the refined topics. It's important to note that the input prompts include these previously refined topics for selection purposes. Develop various prompts and fine-tune GPT’s hyperparameters 3.3 Utilize similarity metrics, coherence measures, and other measurements to evaluate the accuracy of matching between GPT-generated topics and the previously assigned refined topics. Continuously optimize this matching accuracy by experimenting with different prompts and adjusting hyperparameter values. Evaluate GPT-generated topics and productize the most effective prompts and hyperparameters settings
Step 4. GPT and output refinement layer — new topics in recent transcripts
4.1 As with step 3.1, setting up one identical development environment. However, for a valid evaluation of newly generated topics, it's important to use recent transcripts data that actually include new emergent topics, and ideally these new emergent topics can be identified with business team. Extract recent transcripts that contain previously refined topics and new topics 4.2 Develop output refinement layer utilizing similarity metrics, coherence measures, and domain knowledge to categorize GPT-generated topics. This includes classifying them as similar to refined topics, exact matches, or distinct topics that could be either noisy or new emergent topics. Develop output refinement layer to classify outputs as previously refined, new and noisy topics. 4.3 Experiment with rules and thresholds of similarity metrics or coherence measures to to more accurately distinguish between noisy topics and actual new emergent topics Apply text mining techniques to extract patterns in noisy topics. Compared with noisy topics, new topics are expected to have more similarities to previously refined topics. Test output refinement layer and store the optimized rules and thresholds for production.
Step 5. Design an architecture in PROD to Integrate rules, ML model and GPT outcomes
Prior to outlining the items in step 5, it’s important to address following questions:
Do the previously refined topics encompass all aspects of the corresponding historical data, and are they valid for building ML models? Is GPT capable of consistently generating accurate topics that match or closely resemble the previously refined topics? Does GPT consistently generate new topics with very few noises that are relevant to business teams? Are rules-based functions accurately mapping transcripts to the refined topics? Does the optimized ML model accurately map transcripts to the refined topics? How to make sure previously refined topics and unknown new topics to be systematically generated through a refinement layer? how can we separately measure the matching accuracy of known refined topics and the matching accuracy of unknown new topics? How can we optimize matching accuracy by synergistically combining rules, the ML model, and GPT outcomes to complement each other? (Work in Progress for step 5)
Background
Call Center Datasets
pg-zinnia-data-production-v1
call-logger datasets
Call logger databases contain observations collected from user interface that is used by agents during the call. There are 20+ carriers and each carrier has one call logger table that has been sync into BigQuery. These call logger dataset contain client ID, phone number, agent’s typing notes, timestamps, contract, call transcript summary, etc. At this point in time, the transcripts themselves are not yet integrated into the call logger system.
Column contact: for most of calls, there are contract number to help identify what policy was talked about during the call.
Session_ID is important for joining call_logger datasets with other datasets.
We can join CallEntries dataset in call_logger carrier database with Five9_Call_Log in se2ivr database using Session_ID.
se2ivr Initial Voice Routing
Databases in Se2ivr can be considered as the metadata for calls, offering more abstract information. This includes details like the caller's name, phone number, the call's origin, and the agent handling the call, among other things.
Once a day in the morning datasets in se2ivr filled by one report coming out of FIVE9.
The FIVE9-ACD-Detail encompasses the automatic call distribution system, detailing the routing mechanisms and identifying which agent received each call.
The FIVE9-Call-Log serves as a repository for metadata related to calls, which includes information about the caller, their phone number, the origin of the call, and the identity of the agent who handled it.
five9 datasets
Five9 operates as the core servicing platform for the call center, acting as a downstream monitoring source system. It archives audio files of calls and provides a comprehensive summary of the agent's state, detailing the duration an agent spends in each state during a call.
These datasets contain the length of time a call rings, the talk time, and the duration for which a call is on hold, offering a detailed overview of call handling and agent activity. Call service agents, on the other hand, use Call Logger as their primary interface for interacting with Five9.
Zowie MongoDB datasets
To obtain the source data for Zowie, we can execute a data join by matching the 'CTI_CALL_NUMBER' field in Zowie with the 'Session_ID' field in the Call Logger datasets.
Zowie integrates with audio downloaded from the Five9 API. Zowie features two main datasets: an Insights Dataset and a Transcriptions Dataset.
The Transcriptions dataset utilizes diarization to distinguish between the agent and the client during conversations. It also includes other functionality for example identify events occurring a few seconds before each call. The transcription process is handled by WhisperX. Zowie is configured to synchronize this data every hour. Additionally, it pulls certain metadata from the se2ivr and call logger datasets, incorporating this information into its metadata column.
Example prompt: “Given the context and input list, what are the top 3 themes can be generated? ”
There are small number of prompts in the dataset.
Join Production Datasets
pg-zinnia-data-production-v1
call_logger_all_call_entries CallTypeID — Call Topics selected from the fix list. CallerTypeID — Who is the caller? InitiatedDate/CreatedDate — What date did one call start on SessionID: one stream of one call or multiple calls if some of them were transferred, can be used as one join key for agent information. Call transfer might be related to the change of call topic. It’s better to assign topics for each call transcript even though they have same SessionID. Agents assign/choose topics for each record. We can do topic assignment for each record and not necessarily across the session ID, and consistent with GPT setup. carrier - CallEntryID (composite primary key): with each carrier, it should/may be used as one unique identifier on calls. Between carriers, they might share same CallEntryID. CallSummary: shorter version of transcripts GPT output summary using prompt like summary the whole thing in 250 words or less. Bart score can be used to monitor the summary performance. Agent Topics & Role — call_logger_xyz
Join with each relevant call_logger_xyz tables for each carrier - CallEntryID left join on CallTypeID in call_logger_xyz to get CallTypes can also get CallerTypes from relevant call_logger_xyz table
Identify call_logger_all_call_entries as the base dataset to join with other tables in pg-zinnia-data-production-v1.
Transcripts — call_transcriptions
Join call_center_v1.call_transcriptions that is sourced from Zowie_mongodb.transcriptions Zowie_mongodb.transcriptions table store the actual transcripts. call_center_v1.call_transcriptions ctiCallNumber — same as SessionID call_center_v1.call_transcriptions transcription column contains the transcript text actual timestamps of the call: startTime, endTime agentId is jus email address GPT Topics — call_insights
Join call_center_v1.call_insights that is sourced from Zowie_mongodb Call Types: required min distribution — RMD being all caps can be encrypted and look like noise. Join using callId is like sessionID that is not unique to a call. callId is same for multiple transferred calls Cross Join — It’s hard to identify these insight objects were generated for this part of the call. There is no way to uniquely link that. When prompt_key is summary, the result column will be the actual text of the summary. Use callId in call_insights to join call_transcriptions from call_transcriptions, take ctiCallNumber use ctiCallNumber to join with SessionID in call_logger_all_call_entries take result/summary from call_insights to join with CallSummary in call_logger_all_call_entries CallSummary is also from GPT summary One issue is that callID is similar to SessionID, both of them are not unique for each call. We have duplicated callID and SessionID for transferred calls. There is no second key for call_insights.
Caller Role — CallerTypes
Join call_logger_xyz.CallerTypeID with CallerTypeID in call_logger_all_call_entries left join on CallerTypeID in call_logger_xyz to get CallerTypes
Call Agents & Policy Advisors
Use call_logger_all_call_entries to join with Se2ivr.FIVE9_Call_Log on SESSION_ID/Session_ID
Se2ivr.FIVE9_Call_Log contains AGENT_LAST_NAME, DEST_AGENT_NAME Se2ivr.FIVE9_Call_Log and FIVE9_ACD_Detail provide skill, service_level, advisor names, advisor department, etc. Policy advisors are the people who sell the policy to customers. Often times, advisors call into the call center and won’t be the internal representatives or agents to take calls. Call center agents can receive calls from both clients and policy advisors. The process of generating records in FIVE9 is heavily manual. Sometimes the call center agents were supposed to generate new session ID, but they were not.
Policy Information — Contract Number
Contract column in call_logger_all_call_entries, it is the policy number pertaining to the call, is manually typed in by the caller. It’s the key to join other policy-related information. Annuity policy — lifecad_prod T_LIPO_POLICY.PO_POL_NUM can be linked with contract number. Life insurance policy — FAST database FAST is not in the warehouse yet. Previous Development Datasets
pg-zinnia-data-development-v1
Simplified GPT Topics
zowie_stop_gap_analytics: Previous agent_call_type and newest_agent_call_type pg-zinnia-data-development-v1.zowie_stop_gap_analytics.gpt_call_topics zowie_stop_gap_analytics.call_types Refined Agent Topics
Refined topics and previous agent topics pg-zinnia-data-development-v1.zowie_stop_gap_analytics.call_types newest_agent_call type contains the merged topics and rest previous agent topics in the excel. agent_call_type contains all original topics