Overlap in dataset splits #28

jjcmoon · 2020-07-20T13:21:20Z

When looking at the results of make data in a clean repo clone, it seems there is a small overlap in NL descriptions of the train and test datasets (same for the train and dev). After investigating this issue, it seems that a NL description can have multiple corresponding bash commands, which can get placed in different splits. The code in data/scripts/split_data.py seems to address this in the wrong way. The script checks if identical bash commands are placed in different splits. This would be appropriate when performing Bash2NL but not the other way round.

As the amount of descriptions with multiple commands is not that large, the overlap is not very large, so the performance reported will be only slightly decreased (i guesstimate around 1%, have not tried). But I figured you still might want to be aware of this.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap in dataset splits #28

Overlap in dataset splits #28

jjcmoon commented Jul 20, 2020

Overlap in dataset splits #28

Overlap in dataset splits #28

Comments

jjcmoon commented Jul 20, 2020