Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable specifying custom columns and values for splitting. #1823

Closed
justinxzhao opened this issue Mar 16, 2022 · 0 comments
Closed

Enable specifying custom columns and values for splitting. #1823

justinxzhao opened this issue Mar 16, 2022 · 0 comments

Comments

@justinxzhao
Copy link
Contributor

Currently, Ludwig's dataset splitting requires either:

  • Splitting is disabled completely
  • Dataset is split randomly into train, validation, and test, according to preprocessing.split_probabilities
  • Dataset is split according to a special metadata column in the data, split, with a fixed set of special values to do the association: 0: train, 1: validation, 2: test.

We should extend this API to enable users to customize splitting by other columns and values.

preprocessing:
    split:
        (moved) force_split: false
        (moved) split_probabilities: [0.7, 0.1, 0.2]
        (new) split_column: split # Name of column that should be used for splitting
        (new) train_values: [0] # Values in the split_column that should be associated with the training split
        (new) validation_values: [1] # Values in the split_column that should be associated with the validation split
        (new) test_values: [2] # Values in the split_column that should be associated with the test split

Note: we may want to revisit this API if/when we support multiple test sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants