Explore

Composite Scoring Engine for Agent Leaderboard

TL;DR A backend scoring engine that ingests call data, computes normalized per-metric and composite scores for agents, and outputs reliable, fair, and explainable rankings. The system ensures cohort fairness, anti-gaming, and time-windowed recomputation, serving as the foundation for any leaderboard or analytics UI. No frontend or leaderboard UI is included in this scope.

⁠

Goals

Business Goals Overview: The following goals guide the development and success criteria for the Composite Scoring Engine.

Deliver composite agent scores that are accurate, reliable, and updated in a timely manner, ensuring they can be confidently used in leaderboards and analytics.

Ensure fairness across agent cohorts (e.g., campaign, language, call type) to prevent bias and support equitable performance measurement.

Provide explainable, auditable scoring outputs to support coaching, compliance, and trust.

Minimize opportunities for gaming or manipulation of the scoring system, such as cherry-picking calls or artificially inflating metrics.

Maintain high operational reliability and timely recomputation for all supported time windows.

User Goals Overview: The following goals reflect the needs of Team Leaders, Agents, and Operations users who interact with or are impacted by the scoring outputs.

Enable Team Leaders and Operations to fairly compare agent performance, supporting effective coaching, recognition, and operational decision-making.

Allow agents to understand how their performance is measured and what drives their score.

Support data-driven coaching and recognition based on transparent, multi-metric evaluation.

Ensure agents are not penalized or advantaged due to factors outside their control (e.g., call mix, language).

Non-Goals Overview: The following items are explicitly out of scope for this project to maintain focus and clarity.

No leaderboard UI, wallboard, or agent-facing dashboard.

No coaching workflow, feedback, or notification features.

No direct integration with external CRM or HR systems (beyond data ingestion/export).

No real-time streaming or sub-minute recomputation.

⁠

User Stories

As a Data Engineer, I want to ingest call data and compute agent scores so that downstream systems receive consistent, reliable performance metrics.

As a Team Leader, I want to understand the components of each agent’s score, so that I can coach effectively and address fairness concerns.

As a Compliance Analyst, I want to audit how scores are calculated and ensure no agent is unfairly advantaged or penalized.

As a Product Owner, I want to tune metric weights and cohort definitions, so that the scoring system aligns with evolving business goals.

⁠

Functional Requirements Overview: The following requirements define the critical system capabilities needed to deliver the goals outlined above.

Data Ingestion & Preprocessing (Priority: High)

Ingest call records from the primary call table.

Join with customer and provider type tables as needed.

Filter out ineligible calls (e.g., voicemails, wrong numbers, deleted).

Cohorting & Eligibility (Priority: High)

Assign each call and agent to a cohort based on campaign, direction, language, call type, and provider type.

Enforce minimum sample thresholds for inclusion.

Metric Computation (Priority: High)

Calculate per-agent, per-window metrics: conversion rate, objection handling, follow-up, responsiveness, talk time, revenue proxy, rapport, skill, etc.

Normalize metrics within cohort and time window.

Reliability & Shrinkage (Priority: High)

Apply reliability weighting to shrink scores toward cohort mean for low-sample agents.

Composite Score Calculation (Priority: High)

Compute weighted sum of normalized metrics.

Apply default weights, tie-breakers, and caps.

Enforce minimum denominator gates.

Anti-Gaming & Fairness (Priority: Medium)

Exclude short calls, cap repeated dials, penalize cherry-picking, split revenue attribution, and freeze cohort membership per window.

Time Windowing & Recompute (Priority: High)

Support daily, weekly, monthly, and quarterly windows.

Schedule recomputation and handle late-arriving data.

Service Interfaces (Priority: High)

Expose API/data contracts for per-agent, per-window scores and metrics.

Support batch export to data warehouse.

Explainability & Audit Logging (Priority: Medium)

Store per-agent, per-window artifacts: metric contributions, normalization baselines, weights, and exclusions.

Monitoring & Calibration (Priority: Medium)

Implement data quality checks, drift detection, job health monitoring, and calibration routines.

⁠

User Experience Overview: Although there is no direct user interface, the following outlines how system users and integrators interact with the backend service.

Entry Point & First-Time User Experience

Not applicable: This is a backend service with no direct user interface.

Data engineers and system integrators access the service via API or data exports.

Core Experience

Step 1: System ingests new call data from the database.

Validates data integrity, checks for nulls and out-of-range values.

Excludes ineligible calls (voicemail, wrong number, deleted).

Step 2: Assigns each call and agent to a cohort based on defined dimensions.

Ensures minimum sample size and talk time for eligibility.

Step 3: Computes per-agent, per-metric aggregates for the time window.

Applies normalization (z-score, percentile, or rank-based).

Handles winsorization and outlier trimming.

Step 4: Applies reliability shrinkage for low-sample agents.

Shrinks scores toward cohort mean based on sample size.

Step 5: Calculates composite score using weighted sum of normalized metrics.

Applies default weights, tie-breakers, and caps.

Step 6: Stores and exposes results via API and batch export.

Includes explainability artifacts and audit logs.

Step 7: Monitors job health, data drift, and calibration metrics.

Alerts on anomalies or failures.

Advanced Features & Edge Cases

Handles late-arriving data and idempotent recomputation.

Freezes cohort membership per window to prevent cohort shopping.

Penalizes or excludes agents with excessive short calls or repeated dials.

UI/UX Highlights

Not applicable: No user interface in scope.

API responses are structured, well-documented, and versioned for downstream consumers.

⁠

Narrative Overview: The following scenario illustrates how the Composite Scoring Engine addresses key problems in a typical call center environment.

In a high-volume call center, Team Leaders need to compare agent performance fairly and accurately, but raw metrics are often skewed by call mix, language, or campaign. Agents may try to game the system by cherry-picking easy calls or making repeated short dials. The Composite Scoring Engine ingests detailed call data, assigns each agent to a fair cohort, and computes a normalized, multi-metric score that reflects true performance. By applying reliability weighting, anti-gaming rules, and transparent normalization, the engine ensures that every agent’s score is both fair and explainable. Team Leaders can trust the results for coaching and recognition, while agents know that their efforts are measured on a level playing field. The business benefits from improved morale, reduced gaming, and a direct link between agent behavior and outcomes.

⁠

Success Metrics Overview: The following metrics will be tracked to assess the effectiveness and impact of the scoring engine.

Score Stability: < 5% week-over-week variance for agents with stable call volumes.

Predictive Lift: Composite score correlates with downstream conversions/revenue at least 20% better than legacy metrics.

Fairness: < 2% score bias across major cohort dimensions (e.g., language, direction).

Gaming Reduction: 50% decrease in short-call or repeated-dial incidents post-launch.

Coverage: > 90% of active agents receive an eligible composite score each window.

Operational Reliability: > 99% on-time recompute success rate.

User-Centric Metrics

% of Team Leaders who report trust in the scoring system (via survey).

% of agents with access to explainability artifacts (for audit).

Business Metrics

Increase in conversion rates or revenue attributed to improved scoring.

Reduction in manual score adjustments or disputes.

Technical Metrics

API uptime > 99.9%.

Data processing error rate < 0.1%.

Tracking Plan

Number of agents scored per window.

Number of calls excluded and reasons.

API request/response counts and latencies.

Calibration drift and data quality alerts.

⁠

Technical Considerations Overview: The following points highlight key technical factors, risks, and requirements that must be addressed during implementation.

Technical Needs

Batch data processing pipeline for metric computation and normalization.

Cohort assignment logic and eligibility enforcement.

Composite score calculation module with configurable weights.

API endpoints for score retrieval and batch export.

Audit log and explainability artifact storage.

Integration Points

Primary call data table (with all relevant fields).

Customers and provider_types tables for joins.

Downstream leaderboard or analytics systems (API/data export consumers).

Data warehouse for batch exports.

Data Storage & Privacy

All PII (from_phone, to_phone, customer_id) encrypted at rest and in transit.

Access controls by role (e.g., only authorized users can access per-agent scores).

Audit trails for all data access and score computation.

Compliance with SOC 2 and HIPAA as required.

Scalability & Performance

Must support recomputation for 10,000+ agents and millions of calls per window.

Batch jobs complete within 1 hour of window close.

API supports pagination and filtering for large result sets.

Potential Challenges

Ensuring fairness and reliability with highly imbalanced cohorts.

Handling late-arriving or corrected call data.

Preventing and detecting new forms of gaming as agents adapt.

Maintaining explainability as metric definitions evolve.

⁠

Milestones & Sequencing Overview: This section outlines the major phases, deliverables, and dependencies for the project timeline.

Project Estimate

Medium: 2–4 weeks

Team Size & Composition

Small Team: 1–2 total people (Product/Data Engineer, with part-time QA support)

Suggested Phases

Phase 1: MVP Metrics, Normalization, Composite, API (1–2 weeks)

Key Deliverables: Data ingestion, cohorting, metric computation, normalization, composite score calculation, API for per-agent scores.

Dependencies: Access to call data tables, initial metric definitions.

Phase 2: Anti-Gaming and Shrinkage (0.5–1 week)

Key Deliverables: Short-call exclusion, repeated dial caps, reliability shrinkage, cohort freezing.

Dependencies: Phase 1 completion.

Phase 3: Explainability and Monitoring (0.5–1 week)

Key Deliverables: Per-agent explainability artifacts, audit logs, monitoring and calibration routines.

Dependencies: Phase 2 completion.

⁠

Open Questions & Decisions Needed Overview: The following unresolved topics require input or decisions from stakeholders to ensure project success.

What is the default short-call cutoff (e.g., 12s, 15s)?

What are the exact default weights for each metric in the composite score?

How should revenue attribution be handled for multi-touch calls?

Are there additional cohort dimensions (e.g., campaign, region) to include?

Is CSAT or a proxy available for inclusion as a metric?

What override permissions should Team Leaders have for weights or cohort definitions?

What is the retention policy for audit logs and explainability artifacts?

⁠

Done reading
⁠
⁠

Discussion

Discussion

Discussion topic

Author

Upvote

Notes

There are no rows in this table

⁠

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.