Possible data leakage #33

arogozhnikov · 2016-01-14T12:56:04Z

Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

szilard · 2016-01-14T23:47:47Z

Thanks for feedback. Indeed that's possible data leakage (at least partial).

The main goal of this project is to compare the scalability+speed+accuracy of various implementations of the same algos, so this should not matter for this goal.

It might be a problem though for the comparison of e.g. GBM and DL, though I don't think either of them could exploit this without some feature engineering.

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

arogozhnikov · 2016-01-15T13:59:07Z

It might be worth to try to run e.g. RF/GBM/DL with DepTime replaced with PlannedDepTime and compare the AUCs. I might do that later on, but you are free to do that now if you want.

Agree.
At this moment I am testing different LibFM implementations on this data, I'll try to compare RF/GBDT on PlannedDepTime when I'm done.

Also I don't see DL in benchmarks on flight. Results are bad or you don't have time to test?

szilard · 2016-01-15T14:55:52Z

Re: RF/GBDT on PlannedDepTime sounds great, thanks.

Re: DL. I started to do something, but I did not add results to the README yet. Last few weeks I could not work on this, but I hope to get back soon. Anyway, here are some preliminary results with H2o and mxnet, but I'm planning to look at the other tools as well:
#28

szilard changed the title ~~[Fatal] Answer is in the data~~ Possible data leakage Jan 14, 2016

szilard closed this as completed May 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data leakage #33

Possible data leakage #33

arogozhnikov commented Jan 14, 2016

szilard commented Jan 14, 2016

arogozhnikov commented Jan 15, 2016

szilard commented Jan 15, 2016

Possible data leakage #33

Possible data leakage #33

Comments

arogozhnikov commented Jan 14, 2016

szilard commented Jan 14, 2016

arogozhnikov commented Jan 15, 2016

szilard commented Jan 15, 2016