Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible data leakage #33

Closed
arogozhnikov opened this issue Jan 14, 2016 · 3 comments
Closed

Possible data leakage #33

arogozhnikov opened this issue Jan 14, 2016 · 3 comments

Comments

@arogozhnikov
Copy link

Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

@szilard szilard changed the title [Fatal] Answer is in the data Possible data leakage Jan 14, 2016
@szilard
Copy link
Owner

szilard commented Jan 14, 2016

Thanks for feedback. Indeed that's possible data leakage (at least partial).

The main goal of this project is to compare the scalability+speed+accuracy of various implementations of the same algos, so this should not matter for this goal.

It might be a problem though for the comparison of e.g. GBM and DL, though I don't think either of them could exploit this without some feature engineering.

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

@arogozhnikov
Copy link
Author

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

Agree.
At this moment I am testing different LibFM implementations on this data, I'll try to compare RF/GBDT on PlannedDepTime when I'm done.

Also I don't see DL in benchmarks on flight. Results are bad or you don't have time to test?

@szilard
Copy link
Owner

szilard commented Jan 15, 2016

Re: RF/GBDT on PlannedDepTime sounds great, thanks.

Re: DL. I started to do something, but I did not add results to the README yet. Last few weeks I could not work on this, but I hope to get back soon. Anyway, here are some preliminary results with H2o and mxnet, but I'm planning to look at the other tools as well:
#28

@szilard szilard closed this as completed May 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants