Classification & Regression multilabel #815

psinger · 2024-08-07T16:15:13Z

In this PR we add multi-target support for both classification and regression.

The following code-adaptation have been made:

For classification and regression the answer column can now be a selection of multiple targets, changing the setting to a tuple
Multi label will only work with BCE
For plotting and visualizations, the target columns are concatenated as string
The same applies for predictions
Classification predictions are now consistently post-process in postprocess_output instead of in the individual metrics
The validation csv file now contains the probabilities instead of the hard predictions
The validation pickle file now contains logits, probabilities and predictions
For regression, we set the regression head size by number of answer columns
For classification, the num_classes needs to still be set. Potentially would be easier to also start deriving that automatically, would also allow to get rid of all the error checks
Added metric tests for regression and adjusted for classification
Adjusted integration tests

When reviewing, main potential for bugs is with respect to post processing, loss and metrics.

Closes #805

pascal-pfeiffer · 2024-08-12T09:48:55Z

llm_studio/src/datasets/text_causal_regression_ds.py

+        preds = []
+        for col in np.arange(len(cfg.dataset.answer_column)):
+            preds.append(
+                np.round(output["predictions"][:, col].cpu().numpy(), 3).astype(str)


specific reason for the rounding here?

yeah this is shown in the visualizations and with lots of digits not great to read

Then, why not truncate in the visualization? The current implementation also rounds the downloadable predictions.

Actually, this even impacts metric calculation.

It doesnt because output["predicted_text"] is not used there - and truncating in the vis is difficult because it supports all kinds of texts

Ah, right, that was changed in the PR. Still is odd for the exported dataframe to have rounded values

Hmh, debatable. Would be tricky to add something different there. The pickle should be used anyways for exact values.

llm_studio/src/utils/data_utils.py

pascal-pfeiffer

All looks very clean and works well for the usual cases, thank you!

I think we might want to improve a bit on the documentation and on error handling. Especially, for CLI users.

The required change to use a list/tuple format for the answer column is a bit unfortunate as the error message is rather cryptic for CLI users when using previously well working yamls.

Traceback (most recent call last):
  File "/home/xxx/h2o-llmstudio/train.py", line 722, in <module>
    run(cfg=cfg)
  File "/home/xxx/h2o-llmstudio/train.py", line 530, in run
    train_dataset = get_train_dataset(train_df=train_df, cfg=cfg)
  File "/home/xxx/h2o-llmstudio/llm_studio/src/utils/data_utils.py", line 396, in get_train_dataset
    train_dataset: Dataset = cfg.dataset.dataset_class(
  File "/home/xxx/h2o-llmstudio/llm_studio/src/datasets/text_causal_classification_ds.py", line 19, in __init__
    check_for_non_int_answers(cfg, df)
  File "/home/xxx/h2o-llmstudio/llm_studio/src/datasets/text_causal_classification_ds.py", line 104, in check_for_non_int_answers
    x for x in df[column].values if not is_castable_to_int(x)
  File "/home/xxx/.local/share/virtualenvs/h2o-llmstudio-tT3gHl3a/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/xxx/.local/share/virtualenvs/h2o-llmstudio-tT3gHl3a/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'w'

We should handle that better as it is an expected source of error. For the other improvements on documentation, we can handle it in subsequent iterations.

pascal-pfeiffer · 2024-08-12T10:25:54Z

tests/integration/test_causal_binary_classification_modeling_cfg.yaml

+    answer_column: 
+    - binary_label


This required change is a bit unfortunate as the error message is rather cryptic for CLI users when using previously well working yamls.

Traceback (most recent call last): File "/home/xxx/h2o-llmstudio/train.py", line 722, in <module> run(cfg=cfg) File "/home/xxx/h2o-llmstudio/train.py", line 530, in run train_dataset = get_train_dataset(train_df=train_df, cfg=cfg) File "/home/xxx/h2o-llmstudio/llm_studio/src/utils/data_utils.py", line 396, in get_train_dataset train_dataset: Dataset = cfg.dataset.dataset_class( File "/home/xxx/h2o-llmstudio/llm_studio/src/datasets/text_causal_classification_ds.py", line 19, in __init__ check_for_non_int_answers(cfg, df) File "/home/xxx/h2o-llmstudio/llm_studio/src/datasets/text_causal_classification_ds.py", line 104, in check_for_non_int_answers x for x in df[column].values if not is_castable_to_int(x) File "/home/xxx/.local/share/virtualenvs/h2o-llmstudio-tT3gHl3a/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__ indexer = self.columns.get_loc(key) File "/home/xxx/.local/share/virtualenvs/h2o-llmstudio-tT3gHl3a/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 'w'

documentation/docs/tooltips/experiments/_answer-column.mdx

tests/integration/test_causal_regression_modeling_cfg.yaml

pascal-pfeiffer

Thank you for the quick changes, lgtm!

psinger and others added 9 commits August 7, 2024 09:24

start

0faa552

c

73c5cfe

impl

69511f9

format

2c1391a

Merge branch 'main' into psi/multilabel

65e7bf9

format

7d6bc12

tests, docs, format, fixes

8205591

Merge branch 'main' into psi/multilabel

703de9c

tests

119040c

psinger marked this pull request as ready for review August 8, 2024 19:30

psinger requested review from sherenem and pascal-pfeiffer as code owners August 8, 2024 19:30

pascal-pfeiffer reviewed Aug 12, 2024

View reviewed changes

llm_studio/src/utils/data_utils.py Outdated Show resolved Hide resolved

pascal-pfeiffer requested changes Aug 12, 2024

View reviewed changes

psinger added 3 commits August 12, 2024 10:40

updates

4c574af

c

4b99107

cfg checks

29a8f3a

psinger requested a review from pascal-pfeiffer August 12, 2024 12:39

psinger added 2 commits August 12, 2024 13:29

empty

861e0f9

rm

ba3dd94

pascal-pfeiffer approved these changes Aug 12, 2024

View reviewed changes

psinger merged commit aff8044 into main Aug 12, 2024
4 checks passed

psinger deleted the psi/multilabel branch August 12, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification & Regression multilabel #815

Classification & Regression multilabel #815

psinger commented Aug 7, 2024 •

edited

Loading

pascal-pfeiffer Aug 12, 2024

psinger Aug 12, 2024

pascal-pfeiffer Aug 12, 2024 •

edited

Loading

psinger Aug 12, 2024 •

edited

Loading

pascal-pfeiffer Aug 12, 2024

psinger Aug 12, 2024

pascal-pfeiffer left a comment

pascal-pfeiffer Aug 12, 2024

pascal-pfeiffer left a comment

Classification & Regression multilabel #815

Classification & Regression multilabel #815

Conversation

psinger commented Aug 7, 2024 • edited Loading

pascal-pfeiffer Aug 12, 2024

Choose a reason for hiding this comment

psinger Aug 12, 2024

Choose a reason for hiding this comment

pascal-pfeiffer Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

psinger Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

pascal-pfeiffer Aug 12, 2024

Choose a reason for hiding this comment

psinger Aug 12, 2024

Choose a reason for hiding this comment

pascal-pfeiffer left a comment

Choose a reason for hiding this comment

pascal-pfeiffer Aug 12, 2024

Choose a reason for hiding this comment

pascal-pfeiffer left a comment

Choose a reason for hiding this comment

psinger commented Aug 7, 2024 •

edited

Loading

pascal-pfeiffer Aug 12, 2024 •

edited

Loading

psinger Aug 12, 2024 •

edited

Loading