Drift detection between pointblank agents and scan_data #576

florianm · 2024-11-14T05:41:37Z

florianm
Nov 14, 2024

Given a QA pipeline where a pointblank agent verifies that a dataset that is refreshed and appended to periodically conforms to a set of expectations, what would be the most idiomatic and elegant approach to detect drift and visualise differences between the different versions of the dataset?

Assumptions:

The dataset is one rectangular table.
The dataset is updated periodically. Some records may change (upstream edits), some records may be deleted (upstream QA), almost always new records will be appended (this changes assumptions of e.g. valid date ranges).

Naive approach:

After each update to the dataset run the same pointblank validation pipeline. Source some parameters from an external file (e.g. variable names and types, valid ranges/values).
Save the interrogated report, the descriptive stats from scan_data, and the current time (e.g. as rds).
When re-running, load all saved previous reports, then parse the data from each report/scan as needed to access the respective metric we want to compare.
Like pointblank::read_disk_multiagent() but also for pointblank::scan_data().

Useful comparisons:

A qcc plot of number of records (y) over time of data update (current time of pointblank report/scan) indicating whether the latest update has added significantly more or less than expected based on past updates (assuming linear growth of original data, else a simple dot/bar plot).
For each variable, compare values / ranges / distributions. Unusual changes could indicate errors in upstream data extraction.
On the complex end, detecting drift in distributions which can affect predictive models explanation shift
Insight similar to https://www.evidentlyai.com/blog/data-drift-detection-large-datasets

So in short, is there some approach to or interest in

A kind of pointblank::read_disk_multiagent() for pointblank::scan_data()
A function "compare multiple scan_data" that - where appropriate - generates generic comparisons for each descriptive statistic?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drift detection between pointblank agents and scan_data #576

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Drift detection between pointblank agents and scan_data #576

florianm Nov 14, 2024

Replies: 0 comments

florianm
Nov 14, 2024