You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given a QA pipeline where a pointblank agent verifies that a dataset that is refreshed and appended to periodically conforms to a set of expectations, what would be the most idiomatic and elegant approach to detect drift and visualise differences between the different versions of the dataset?
Assumptions:
The dataset is one rectangular table.
The dataset is updated periodically. Some records may change (upstream edits), some records may be deleted (upstream QA), almost always new records will be appended (this changes assumptions of e.g. valid date ranges).
Naive approach:
After each update to the dataset run the same pointblank validation pipeline. Source some parameters from an external file (e.g. variable names and types, valid ranges/values).
Save the interrogated report, the descriptive stats from scan_data, and the current time (e.g. as rds).
When re-running, load all saved previous reports, then parse the data from each report/scan as needed to access the respective metric we want to compare.
Like pointblank::read_disk_multiagent() but also for pointblank::scan_data().
Useful comparisons:
A qcc plot of number of records (y) over time of data update (current time of pointblank report/scan) indicating whether the latest update has added significantly more or less than expected based on past updates (assuming linear growth of original data, else a simple dot/bar plot).
For each variable, compare values / ranges / distributions. Unusual changes could indicate errors in upstream data extraction.
On the complex end, detecting drift in distributions which can affect predictive models explanation shift
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Given a QA pipeline where a pointblank agent verifies that a dataset that is refreshed and appended to periodically conforms to a set of expectations, what would be the most idiomatic and elegant approach to detect drift and visualise differences between the different versions of the dataset?
Assumptions:
Naive approach:
Like
pointblank::read_disk_multiagent()
but also forpointblank::scan_data()
.Useful comparisons:
So in short, is there some approach to or interest in
pointblank::read_disk_multiagent()
forpointblank::scan_data()
Beta Was this translation helpful? Give feedback.
All reactions