-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible data leakage #33
Comments
Thanks for feedback. Indeed that's possible data leakage (at least partial). The main goal of this project is to compare the scalability+speed+accuracy of various implementations of the same algos, so this should not matter for this goal. It might be a problem though for the comparison of e.g. GBM and DL, though I don't think either of them could exploit this without some feature engineering. It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want. |
Agree. Also I don't see DL in benchmarks on flight. Results are bad or you don't have time to test? |
Re: RF/GBDT on PlannedDepTime sounds great, thanks. Re: DL. I started to do something, but I did not add results to the README yet. Last few weeks I could not work on this, but I hope to get back soon. Anyway, here are some preliminary results with H2o and mxnet, but I'm planning to look at the other tools as well: |
Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.
HOWEVER
The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).
At least, I could easily see such patterns in data.
It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.
So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.
The text was updated successfully, but these errors were encountered: