Dagster Deep Dive on Data quality
August 6, 2024
Dimensions of data quality:
- Timeliness
- data is ready within a certain time frame
- Validity
- data values conform to an accepted format
- Completeness
- data is fully populated in attributes and records
- Consistency
- data is aligned across systems and sources
- Accuracy
- data values are aligned with a source of truth
- Uniqueness
- data is free of duplicate values
Data validation tools:
- soda - https://github.com/sodadata/soda-core
- great expectations
- deequ - https://github.com/awslabs/deequ
- dbt tests
- dagster asset checks
Common challenges:
- Managing data quality across distributed teams
- Retroactively enforcing standards and dealing with legacy systems
- Upfront developer cest of following data quality best practices
- Establishing ownership of data
Notes:
- validation should occur at all stages of the data lifecycle (orchestration is a natural home for this)
- platform owners and governance teams should establish frameworks that promote data enforcement and validation