Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook Examples for DP + GE: expect_profile_numeric_columns_percent #625

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"\n",
"- **Expectation Used:** [expect_profile_numeric_columns_diff_between_inclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_diff_between_inclusive_threshold_range.py)\n",
"\n",
"- **Expectation Description:** This expectation will check the use-specified difference between use-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n",
"- **Expectation Description:** This expectation will check the user-specified difference between user-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n",
"\n",
"- **Example Details:** In this example a data owner has a dataset containing salary information on individuals in the data science field ranging from **2020** to **2022**. The data owner has been tasked with checking the historical trends of salary growth across the industry with the numbers they have. They want to use this expectation to determine if the median growth of salaries at their company are within the same range of median growth as the companies they have data on."
],
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"## Overview\n",
"\n",
"- **Expectation Used:** [expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range.py)\n",
"\n",
"- **Expectation Description:** This expectation will check the user-specified percent difference between user-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n",
"\n",
"- **Example Details:** In this example a data owner has a dataset containing salary information on individuals in the data science field ranging from **2020** to **2022**. The data owner wants to use their data to compare the salary growth from **2020** to **2022** with the inflation that the world experienced during the pandemic. They want to use this expectation to determine if the median growth of salaries in the industry are approximately around the same percent as pandemic inflation numbers."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Imports"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"import os\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Great expectations imports\n",
"import great_expectations as ge\n",
"from capitalone_dataprofiler_expectations.expectations. \\\n",
" expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range \\\n",
" import ExpectProfileNumericColumnsPercentDiffBetweenExclusiveThresholdRange\n",
"from great_expectations.self_check.util import build_pandas_validator_with_data\n",
"\n",
"# Data Profiler import\n",
"import dataprofiler as dp"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Setup\n",
"Below we are going to import a dataset from the Data Profile testing suite. This csv holds information on the salaries of individuals in the data science field from all over the world."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"context = ge.get_context()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"data_path = \"../../dataprofiler/tests/data/csv/ds_salaries.csv\"\n",
"data = dp.Data(data_path).data\n",
"data"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"For this expectation we are going to compare the diff median `salary_in_usd` between `work_year` **2020** and **2022**. Below we are gathering the different years that are recorded in this dataset."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"data.sort_values(by=\"work_year\", axis=0, inplace=True)\n",
"years = data[\"work_year\"].unique().tolist()\n",
"years"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Now that we have the years, we will capture all records from each year in their own dataframes so that we can process them separately."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"individual_dataframes = []\n",
"for year in years:\n",
" current_year_df = data.loc[data[\"work_year\"]==year]\n",
" current_year_df = current_year_df.drop(\"work_year\", axis=1)\n",
" individual_dataframes.append(current_year_df)\n",
"year_2020_salary_df = individual_dataframes[0]\n",
"year_2022_salary_df = individual_dataframes[2]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Now we will create a report on the first individual dataframe which corresponds to the year **2020**, then we will output the median `salary_in_usd` from this dataframe."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"profiler_options = dp.ProfilerOptions()\n",
"profiler_options.set({\"data_labeler.is_enabled\": False})\n",
"\n",
"profile = dp.Profiler(year_2020_salary_df, len(year_2020_salary_df), options=profiler_options)\n",
"profile.save(filepath='year_2020_salary_profile.pkl')\n",
"report = profile.report(report_options={\"output_format\": \"compact\"})"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Let's take a look at the output `median` metric from the `salary_in_usd` column as found in the report"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"report['data_stats'][6]['statistics']['median']"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"As of August 2022, inflation rates since **2020** were **15%**. The data owner wanted to set up an expectation that the percent difference in the `median` metric for the `salary_in_usd` column is greater than **13%** but less than **17%**."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"validator = build_pandas_validator_with_data(year_2022_salary_df)\n",
"results = validator.expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range(\n",
" profile_path='year_2020_salary_profile.pkl',\n",
" limit_check_report_keys={\n",
" \"salary_in_usd\": {\n",
" \"median\": {\"lower\": .13, \"upper\": .17},\n",
" },\n",
" }\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Results\n",
"From the output below, the data owner can see that the expectation has an unexpected value. The result shows that the diff between the two profiles is significantly more than the upperbound at about **0.58** or **58%**. The data that the data owner has indicates significantly higher salary growth than inflation over the last two years."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"results"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}