feat(tutorials): add main tutorial #20

UrbanoFonseca · 2021-09-14T00:53:35Z

Added main tutorial for introduction to ydata_quality package.

tutorials/main.ipynb

jfsantos-ds · 2021-09-14T10:11:55Z

tutorials/main.ipynb

+      "List of warnings sorted by priority:\n",
+      "\t[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)\n",
+      "\t[PREDEFINED VALUED MISSING VALUES] Found 1960 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)\n",
+      "\t[FLATLINES] Found 4627 flatline events with a minimun length of 5 among the columns {'relationship', 'workclass2', 'workclass', 'marital-status', 'capital-gain', 'capital-loss', 'sex', 'education-num', 'income', 'education', 'native-country', 'occupation', 'hours-per-week', 'race'}. (Priority 2: usage allowed, limited human intelligibility)\n",


I am guessing a good tweak for the VMV engine is for us to make the flatline detector mute on categorical variables by default (levering dtypes). Notice here we are raising warning for flatlines in the sex category even 😆.
Generalizing further we might need something that detects the dataset type, so that we don't even look for flatline events in tabular datasets.

P.S: I was actually working on a setter method for the dataset type at some point but did not implement I think, probably will be useful in our integration attempt of all engines 🤔

I'd add a requirement to the flatlines VMV test that a TimeIndex must exist in the dataset (this would define the dataset as time-series). Something in-between options 1 and 2 below seems feasible for now

Options per degree of effort

If no time index exists, skip flatlines

Define a DatasetType Enum (Tabular, Timeseries), a method to infer the dataset type and

Option 2 & Add the DatasetType as an argument of the engines to determine which tests will run.

I am thinking the same and actually that at this point we might just go for 3.
At this point should be fair to assume that the index column of the passed dataframe is to be used as index.
I had already worked on a method to detect valid timeseries index types. Positive detection could be used as our heuristic to detect timeseries datasets and conduct the logic of the engines and the aggregator engine.

tutorials/main.ipynb

jfsantos-ds · 2021-09-14T10:36:57Z

tutorials/main.ipynb

+    "\n",
+    "df = pd.read_csv(f'../datasets/transformed/census_10k_v3.csv') # load data\n",
+    "dq = DataQuality(df=df) # create the main class that holds all quality modules\n",
+    "results = dq.evaluate() # run the tests"


At evaluate of DataQuality should we print out engines getting skipped by default due to missing args?

those messages are being printed but I'm hiding them with the %%capture magic, because we still have a lot of non-relevant messages being printed at this time. Whenever that is fixed, we can remove the %%capture and the print messages of skipped engines will appear

tutorials/main.ipynb

jfsantos-ds

Great job! This looks awesome, just don't forget to update the transformed dataset 👍🏼 🚀

tutorials/main.ipynb

feat(tutorials): add main tutorial

422cd38

UrbanoFonseca added the feature A new feature label Sep 14, 2021

UrbanoFonseca requested a review from jfsantos-ds September 14, 2021 00:53

UrbanoFonseca self-assigned this Sep 14, 2021