-
Notifications
You must be signed in to change notification settings - Fork 0
General Data Checks
Want to check a few general but important things about your data? Panda Patrol comes pre-built with a few checks that run on your data. The best part? It only takes one function call to run these checks.
To run a variety of pre-built checks on your data, simply call the check function. This function takes the following parameters:
-
df: pd.DataFrame- The dataframe to run the checks on. -
patrol_group_name: str- Name of the patrol group. See Patrol Groups and Patrols for more information.
When called, this function will run a variety of checks on your data and show those results in the dashboard.
The following example uses Prefect for data pipeline orchestration.
from panda_patrol.checks import check
@task(retries=3)
def fetch_data():
"""
Fetch the Titanic dataset using requests.
"""
url = (
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
response = requests.get(url)
data = pd.read_csv(StringIO(response.text))
check(data, "Titanic Dataset")
return data
@flow(name="Titanic", log_prints=True)
def run_titanic_analysis():
"""
Fetch and transform the Titanic dataset.
"""
fetch_data()
if __name__ == "__main__":
run_titanic_analysis.serve(name="run_titanic_analysis")The library uses a series of heuristics to see if a check should be applied to the data. If the heuristic is met, the check is applied. You can see the heuristics here. The first run of a check will create the benchmark. Subsequent runs will compare the data against this benchmark. For example if a check creates min and max benchmarks, the first run will create the min and max benchmarks. Subsequent runs will compare the min and max values of the data against the min and max benchmarks. See Parameters for more information on how these benchmarks can be adjusted. At a high level, the checks are:
- Accuracy: Runs if the column is made of integers or floats. If so, it checks if the values are within a certain range.
- Completeness: Always runs. Checks if the column has exceeded a certain threshold of missing values.
- Duplicates: Always runs. Checks if the column has exceeded a certain threshold of duplicate values.
- Enums: Runs if the column is made of a few unique values. If so, it checks if the number of these unique values remain the same.
-
Freshness Runs if the column is a datetime. If so, it checks that the latest date is within at least
ndays of the current date. - Volume: Runs if the dataframe contains one or more rows. If so, it checks if the number of rows is less than or equal to 1.5 times the number of rows in the benchmark.
Documentation