Enable specifying custom columns and values for splitting. #1823

justinxzhao · 2022-03-16T20:31:35Z

Currently, Ludwig's dataset splitting requires either:

Splitting is disabled completely
Dataset is split randomly into train, validation, and test, according to preprocessing.split_probabilities
Dataset is split according to a special metadata column in the data, split, with a fixed set of special values to do the association: 0: train, 1: validation, 2: test.

We should extend this API to enable users to customize splitting by other columns and values.

preprocessing:
    split:
        (moved) force_split: false
        (moved) split_probabilities: [0.7, 0.1, 0.2]
        (new) split_column: split # Name of column that should be used for splitting
        (new) train_values: [0] # Values in the split_column that should be associated with the training split
        (new) validation_values: [1] # Values in the split_column that should be associated with the validation split
        (new) test_values: [2] # Values in the split_column that should be associated with the test split

Note: we may want to revisit this API if/when we support multiple test sets.

The text was updated successfully, but these errors were encountered:

This was referenced Jun 10, 2022

RFC: Reorganize split configuration to support time-based split strategy #2129

Closed

Restructured split config and added datetime splitting #2132

Merged

mhabedank closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable specifying custom columns and values for splitting. #1823

Enable specifying custom columns and values for splitting. #1823

justinxzhao commented Mar 16, 2022

Enable specifying custom columns and values for splitting. #1823

Enable specifying custom columns and values for splitting. #1823

Comments

justinxzhao commented Mar 16, 2022