Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tutorials): add main tutorial #20

Merged
merged 3 commits into from
Sep 15, 2021
Merged

Conversation

UrbanoFonseca
Copy link
Contributor

Added main tutorial for introduction to ydata_quality package.

@UrbanoFonseca UrbanoFonseca added the feature A new feature label Sep 14, 2021
@UrbanoFonseca UrbanoFonseca self-assigned this Sep 14, 2021
tutorials/main.ipynb Outdated Show resolved Hide resolved
tutorials/main.ipynb Outdated Show resolved Hide resolved
"List of warnings sorted by priority:\n",
"\t[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)\n",
"\t[PREDEFINED VALUED MISSING VALUES] Found 1960 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)\n",
"\t[FLATLINES] Found 4627 flatline events with a minimun length of 5 among the columns {'relationship', 'workclass2', 'workclass', 'marital-status', 'capital-gain', 'capital-loss', 'sex', 'education-num', 'income', 'education', 'native-country', 'occupation', 'hours-per-week', 'race'}. (Priority 2: usage allowed, limited human intelligibility)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing a good tweak for the VMV engine is for us to make the flatline detector mute on categorical variables by default (levering dtypes). Notice here we are raising warning for flatlines in the sex category even 😆.
Generalizing further we might need something that detects the dataset type, so that we don't even look for flatline events in tabular datasets.

P.S: I was actually working on a setter method for the dataset type at some point but did not implement I think, probably will be useful in our integration attempt of all engines 🤔

Copy link
Contributor Author

@UrbanoFonseca UrbanoFonseca Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a requirement to the flatlines VMV test that a TimeIndex must exist in the dataset (this would define the dataset as time-series). Something in-between options 1 and 2 below seems feasible for now

Options per degree of effort

  1. If no time index exists, skip flatlines
  2. Define a DatasetType Enum (Tabular, Timeseries), a method to infer the dataset type and
  3. Option 2 & Add the DatasetType as an argument of the engines to determine which tests will run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking the same and actually that at this point we might just go for 3.
At this point should be fair to assume that the index column of the passed dataframe is to be used as index.
I had already worked on a method to detect valid timeseries index types. Positive detection could be used as our heuristic to detect timeseries datasets and conduct the logic of the engines and the aggregator engine.

tutorials/main.ipynb Outdated Show resolved Hide resolved
"\n",
"df = pd.read_csv(f'../datasets/transformed/census_10k_v3.csv') # load data\n",
"dq = DataQuality(df=df) # create the main class that holds all quality modules\n",
"results = dq.evaluate() # run the tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At evaluate of DataQuality should we print out engines getting skipped by default due to missing args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those messages are being printed but I'm hiding them with the %%capture magic, because we still have a lot of non-relevant messages being printed at this time. Whenever that is fixed, we can remove the %%capture and the print messages of skipped engines will appear

tutorials/main.ipynb Outdated Show resolved Hide resolved
tutorials/main.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@jfsantos-ds jfsantos-ds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! This looks awesome, just don't forget to update the transformed dataset 👍🏼 🚀

tutorials/main.ipynb Outdated Show resolved Hide resolved
@UrbanoFonseca UrbanoFonseca marked this pull request as ready for review September 15, 2021 10:20
@UrbanoFonseca UrbanoFonseca merged commit 0761ea5 into master Sep 15, 2021
@portellaa portellaa deleted the feat/main-tutorial branch September 23, 2021 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants