Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kaggle test splits #2675

Merged
merged 2 commits into from
Oct 25, 2022
Merged

Add Kaggle test splits #2675

merged 2 commits into from
Oct 25, 2022

Conversation

abidwael
Copy link
Contributor

Adds missing test splits from a number of Kaggle datasets.

@abidwael abidwael requested a review from dantreiman October 19, 2022 08:50
@github-actions
Copy link

github-actions bot commented Oct 19, 2022

Unit Test Results

         6 files  ±  0           6 suites  ±0   3h 46m 1s ⏱️ + 16m 56s
  3 504 tests ±  0    3 383 ✔️ ±  0    79 💤 ±  0  42 ±0 
10 456 runs   - 56  10 170 ✔️  - 43  244 💤  - 13  42 ±0 

For more details on these failures, see this check.

Results for commit 5c27609. ± Comparison against base commit d365d84.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@dantreiman dantreiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to define a different category for these "test" sets, because these are the test sets for submission to Kaggle and they don't have truth labels included. Thus we can't use them as a "test" split in Ludwig.

Note that train.csv has a loss column (representing the monetary loss from the insurance claim), while test.csv does not

Copy link
Contributor

@dantreiman dantreiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that it might mislead users to call splits without a label column "test" -- since in Ludwig's process the test set is a held-out labeled set. We should probably to introduce a new category to support unlabeled data for inference or contest submissions.

@abidwael
Copy link
Contributor Author

abidwael commented Oct 20, 2022

Good catch @dantreiman! I removed the test files from some of the other datasets that had them. We can add a functionality that submits to Kaggle in the future.

@abidwael abidwael requested a review from dantreiman October 20, 2022 17:15
@abidwael abidwael changed the title Add test splits for tabular datasets Remove Kaggle test splits Oct 20, 2022
@abidwael abidwael force-pushed the add-benchmarking-datasets branch from 91e2e33 to 5c27609 Compare October 25, 2022 07:32
@abidwael abidwael changed the title Remove Kaggle test splits Add Kaggle test splits Oct 25, 2022
@abidwael abidwael merged commit d49b4d7 into master Oct 25, 2022
@abidwael abidwael deleted the add-benchmarking-datasets branch October 25, 2022 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants