1) Awareness - Aware of DQ issues; collecting list of issues manually through exploration, working on projects, or errors in reporting raised by users. Good example is
2) Reactive (this is where we are) - Automated alerting based on data critical or business critical data quality checks that alert after data is in production. There is a risk that stakeholders are impacted and the data team is always working behind the eight ball; but the automation piece helps the team catch these issues and address them prior to being alerted by stakeholders
3) Proactive - The data team is able to prevent data critical failures from hitting production, while also being able to alert stakeholders of anomalies, bugs, and other negative trends through business critical checks
Types of Checks
Data Critical
These errors should be breaking errors for our data pipeline. Data impacted by these errors should be proactively prevented from getting to production as they pose a serious risk to the integrity of our data and reporting.
Completeness
Freshness
Grain integrity
Null handling
Business Critical
These errors are what we think of when we say "exception reporting". They are normally related to oddities from upstream data sources, funky/outdated logic, or unconventional system design/business processes. We should be aware and warned of these issues with the end goal of being able to quickly diagnose and triage.
allows stakeholders to reach our team in the event that they find DQ issues in our reporting or Snowflake tables
Anomalo
What is Anomalo?
Anomalo is a tool that data teams at Block leverage to run various data quality checks on their tables. Anomalo can check things such as daily freshness, expected values, table granularity, and other custom checks that can be created.
SCA Anomalo Configuration
The SCA team currently has all ETLs configured for fundamental daily checks under the
The next evolution for using Anomalo is to introduce validation rules. Validation rules can cover:
Checking every value of a column
Checking a relationship between multiple columns
Compare multiple tables or SQL outputs
Check column names and data types
Check that a column defined in custom SQL logic is always true
Check that a custom SQL query returns no bad data
The final evolution for using Anomalo is to develop key metrics. Metrics can be defined using a variety of pre-defined aggregates such as: (Photo from this