Share
Explore

icon picker
SCA Data Quality Framework


Maturation Curve

1) Awareness - Aware of DQ issues; collecting list of issues manually through exploration, working on projects, or errors in reporting raised by users. Good example is
2) Reactive (this is where we are) - Automated alerting based on data critical or business critical data quality checks that alert after data is in production. There is a risk that stakeholders are impacted and the data team is always working behind the eight ball; but the automation piece helps the team catch these issues and address them prior to being alerted by stakeholders
3) Proactive - The data team is able to prevent data critical failures from hitting production, while also being able to alert stakeholders of anomalies, bugs, and other negative trends through business critical checks

Types of Checks

Data Critical

These errors should be breaking errors for our data pipeline. Data impacted by these errors should be proactively prevented from getting to production as they pose a serious risk to the integrity of our data and reporting.
Completeness
Freshness
Grain integrity
Null handling

Business Critical

These errors are what we think of when we say "exception reporting". They are normally related to oddities from upstream data sources, funky/outdated logic, or unconventional system design/business processes. We should be aware and warned of these issues with the end goal of being able to quickly diagnose and triage.
Expected values
Column comparisons
Dimension volume
Anomaly detection

DQ Tools

Slack

notifies our team of Squarewave ETL failures and other data quality failures raised by Anomalo
allows stakeholders to reach our team in the event that they find DQ issues in our reporting or Snowflake tables

Anomalo

What is Anomalo?
Anomalo is a tool that data teams at Block leverage to run various data quality checks on their tables. Anomalo can check things such as daily freshness, expected values, table granularity, and other custom checks that can be created.
SCA Anomalo Configuration
The SCA team currently has all ETLs configured for fundamental daily checks under the label.
These daily checks include:
Data Freshness
Data Volume
Missing Data
Table Anomalies
The next evolution for using Anomalo is to introduce validation rules. Validation rules can cover:
Checking every value of a column
Checking a relationship between multiple columns
Compare multiple tables or SQL outputs
Check column names and data types
Check that a column defined in custom SQL logic is always true
Check that a custom SQL query returns no bad data
The final evolution for using Anomalo is to develop key metrics. Metrics can be defined using a variety of pre-defined aggregates such as: (Photo from this )
Screenshot 2024-03-08 at 12.53.27 PM.png
For more reading on Anomalo’s functionality and possibilities see their
.

JIRA

Any data quality bugs should be raised in JIRA as a bug ticket in the SCA team project. Create JIRA ticket


Next Steps on Our Road to Proactive

Identify critical, warning, and contextual alerts/warnings/metrics our team wants to enable in the
Prioritize the critical and warning level alerts to solidify the foundation of our framework
Build out and test the contextual business alerts and metrics we’d like to create to test the full capacity of Anomalo
Explore avenues to prevent bad production data that fails critical tests from reaching stakeholders and reports


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.