diff --git a/examples/great_expecations/ge_expect_profile_numeric_columns_diff_in_range.ipynb b/examples/great_expecations/ge_expect_profile_numeric_columns_diff_in_range.ipynb index d2399dad2..8fc41fb53 100644 --- a/examples/great_expecations/ge_expect_profile_numeric_columns_diff_in_range.ipynb +++ b/examples/great_expecations/ge_expect_profile_numeric_columns_diff_in_range.ipynb @@ -7,7 +7,7 @@ "\n", "- **Expectation Used:** [expect_profile_numeric_columns_diff_between_inclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_diff_between_inclusive_threshold_range.py)\n", "\n", - "- **Expectation Description:** This expectation will check the use-specified difference between use-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n", + "- **Expectation Description:** This expectation will check the user-specified difference between user-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n", "\n", "- **Example Details:** In this example a data owner has a dataset containing salary information on individuals in the data science field ranging from **2020** to **2022**. The data owner has been tasked with checking the historical trends of salary growth across the industry with the numbers they have. They want to use this expectation to determine if the median growth of salaries at their company are within the same range of median growth as the companies they have data on." ], diff --git a/examples/great_expecations/ge_expect_profile_numeric_columns_percent_diff_in_range.ipynb b/examples/great_expecations/ge_expect_profile_numeric_columns_percent_diff_in_range.ipynb new file mode 100644 index 000000000..7b0d2ff19 --- /dev/null +++ b/examples/great_expecations/ge_expect_profile_numeric_columns_percent_diff_in_range.ipynb @@ -0,0 +1,303 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "## Overview\n", + "\n", + "- **Expectation Used:** [expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range.py)\n", + "\n", + "- **Expectation Description:** This expectation will check the user-specified percent difference between user-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.\n", + "\n", + "- **Example Details:** In this example a data owner has a dataset containing salary information on individuals in the data science field ranging from **2020** to **2022**. The data owner wants to use their data to compare the salary growth from **2020** to **2022** with the inflation that the world experienced during the pandemic. They want to use this expectation to determine if the median growth of salaries in the industry are approximately around the same percent as pandemic inflation numbers." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### Imports" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "# Great expectations imports\n", + "import great_expectations as ge\n", + "from capitalone_dataprofiler_expectations.expectations. \\\n", + " expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range \\\n", + " import ExpectProfileNumericColumnsPercentDiffBetweenExclusiveThresholdRange\n", + "from great_expectations.self_check.util import build_pandas_validator_with_data\n", + "\n", + "# Data Profiler import\n", + "import dataprofiler as dp" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### Setup\n", + "Below we are going to import a dataset from the Data Profile testing suite. This csv holds information on the salaries of individuals in the data science field from all over the world." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "context = ge.get_context()" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "data_path = \"../../dataprofiler/tests/data/csv/ds_salaries.csv\"\n", + "data = dp.Data(data_path).data\n", + "data" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "For this expectation we are going to compare the diff median `salary_in_usd` between `work_year` **2020** and **2022**. Below we are gathering the different years that are recorded in this dataset." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "data.sort_values(by=\"work_year\", axis=0, inplace=True)\n", + "years = data[\"work_year\"].unique().tolist()\n", + "years" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Now that we have the years, we will capture all records from each year in their own dataframes so that we can process them separately." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "individual_dataframes = []\n", + "for year in years:\n", + " current_year_df = data.loc[data[\"work_year\"]==year]\n", + " current_year_df = current_year_df.drop(\"work_year\", axis=1)\n", + " individual_dataframes.append(current_year_df)\n", + "year_2020_salary_df = individual_dataframes[0]\n", + "year_2022_salary_df = individual_dataframes[2]" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Now we will create a report on the first individual dataframe which corresponds to the year **2020**, then we will output the median `salary_in_usd` from this dataframe." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "profiler_options = dp.ProfilerOptions()\n", + "profiler_options.set({\"data_labeler.is_enabled\": False})\n", + "\n", + "profile = dp.Profiler(year_2020_salary_df, len(year_2020_salary_df), options=profiler_options)\n", + "profile.save(filepath='year_2020_salary_profile.pkl')\n", + "report = profile.report(report_options={\"output_format\": \"compact\"})" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's take a look at the output `median` metric from the `salary_in_usd` column as found in the report" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "report['data_stats'][6]['statistics']['median']" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "As of August 2022, inflation rates since **2020** were **15%**. The data owner wanted to set up an expectation that the percent difference in the `median` metric for the `salary_in_usd` column is greater than **13%** but less than **17%**." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "validator = build_pandas_validator_with_data(year_2022_salary_df)\n", + "results = validator.expect_profile_numeric_columns_percent_diff_between_exclusive_threshold_range(\n", + " profile_path='year_2020_salary_profile.pkl',\n", + " limit_check_report_keys={\n", + " \"salary_in_usd\": {\n", + " \"median\": {\"lower\": .13, \"upper\": .17},\n", + " },\n", + " }\n", + ")" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "### Results\n", + "From the output below, the data owner can see that the expectation has an unexpected value. The result shows that the diff between the two profiles is significantly more than the upperbound at about **0.58** or **58%**. The data that the data owner has indicates significantly higher salary growth than inflation over the last two years." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "results" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file