additional logging to get maximum token length of a sequence in the dataset #1066

winglian · 2024-01-08T23:52:23Z

No description provided.

…ataset

…ing anything longer

src/axolotl/utils/trainer.py

NanoCode012 · 2024-01-10T04:05:42Z

src/axolotl/utils/trainer.py

+            max_input_len = np.max(get_dataset_lengths(train_dataset))
+            LOG.debug(f"max_input_len: {max_input_len}", main_process_only=True)
+
+        train_dataset = train_dataset.filter(drop_long, num_proc=cfg.dataset_processes)


Is it possible to count how many samples are dropped?

yeah, let's slate that for another feature/PR. This was just shuffling around the order.

winglian added 2 commits January 8, 2024 15:54

additional logging to get maximum token length of a sequence in the d…

b88534b

…ataset

fix ordering to properly determine the max_len of tokens before dropp…

86882cd

…ing anything longer

winglian requested a review from NanoCode012 January 9, 2024 01:33

NanoCode012 reviewed Jan 9, 2024

View reviewed changes

src/axolotl/utils/trainer.py Show resolved Hide resolved

winglian requested a review from NanoCode012 January 10, 2024 04:00

NanoCode012 reviewed Jan 10, 2024

View reviewed changes

NanoCode012 approved these changes Jan 10, 2024

View reviewed changes

winglian merged commit 2f2582e into main Jan 10, 2024
6 checks passed

winglian deleted the log-token-len branch January 10, 2024 05:49

winglian mentioned this pull request Jan 22, 2024

Vram fix attempt #1164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional logging to get maximum token length of a sequence in the dataset #1066

additional logging to get maximum token length of a sequence in the dataset #1066

winglian commented Jan 8, 2024

NanoCode012 Jan 10, 2024

winglian Jan 10, 2024

additional logging to get maximum token length of a sequence in the dataset #1066

additional logging to get maximum token length of a sequence in the dataset #1066

Conversation

winglian commented Jan 8, 2024

NanoCode012 Jan 10, 2024

Choose a reason for hiding this comment

winglian Jan 10, 2024

Choose a reason for hiding this comment