-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tutorials): add main tutorial #20
Conversation
tutorials/main.ipynb
Outdated
"List of warnings sorted by priority:\n", | ||
"\t[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)\n", | ||
"\t[PREDEFINED VALUED MISSING VALUES] Found 1960 vmvs in the dataset. (Priority 2: usage allowed, limited human intelligibility)\n", | ||
"\t[FLATLINES] Found 4627 flatline events with a minimun length of 5 among the columns {'relationship', 'workclass2', 'workclass', 'marital-status', 'capital-gain', 'capital-loss', 'sex', 'education-num', 'income', 'education', 'native-country', 'occupation', 'hours-per-week', 'race'}. (Priority 2: usage allowed, limited human intelligibility)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing a good tweak for the VMV engine is for us to make the flatline detector mute on categorical variables by default (levering dtypes). Notice here we are raising warning for flatlines in the sex category even 😆.
Generalizing further we might need something that detects the dataset type, so that we don't even look for flatline events in tabular datasets.
P.S: I was actually working on a setter method for the dataset type at some point but did not implement I think, probably will be useful in our integration attempt of all engines 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a requirement to the flatlines VMV test that a TimeIndex must exist in the dataset (this would define the dataset as time-series). Something in-between options 1 and 2 below seems feasible for now
Options per degree of effort
- If no time index exists, skip flatlines
- Define a DatasetType Enum (Tabular, Timeseries), a method to infer the dataset type and
- Option 2 & Add the DatasetType as an argument of the engines to determine which tests will run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking the same and actually that at this point we might just go for 3.
At this point should be fair to assume that the index column of the passed dataframe is to be used as index.
I had already worked on a method to detect valid timeseries index types. Positive detection could be used as our heuristic to detect timeseries datasets and conduct the logic of the engines and the aggregator engine.
"\n", | ||
"df = pd.read_csv(f'../datasets/transformed/census_10k_v3.csv') # load data\n", | ||
"dq = DataQuality(df=df) # create the main class that holds all quality modules\n", | ||
"results = dq.evaluate() # run the tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At evaluate of DataQuality should we print out engines getting skipped by default due to missing args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those messages are being printed but I'm hiding them with the %%capture magic, because we still have a lot of non-relevant messages being printed at this time. Whenever that is fixed, we can remove the %%capture and the print messages of skipped engines will appear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! This looks awesome, just don't forget to update the transformed dataset 👍🏼 🚀
Added main tutorial for introduction to ydata_quality package.