Efficiently get the length of the tokenized docs #1063

RicardoDominguez · 2024-01-08T18:36:34Z

In the current implementation, the entire dataset has to be read in order to calculate the length of the documents -which can be reasonably time consuming for large datasets- even though a "length" column is created during the preprocessing of the dataset. Instead, if the length column exists in the dataset, simply load it, which can be much faster.

winglian

great PR. thanks!

Efficiently get the length of the tokenized docs

28e505b

winglian force-pushed the efficient_len branch from dd5d9d3 to 28e505b Compare January 8, 2024 19:59

winglian approved these changes Jan 8, 2024

View reviewed changes

chore: lint

7f0ca82

winglian force-pushed the efficient_len branch from 5d8e178 to 7f0ca82 Compare January 8, 2024 20:10

winglian merged commit 81d3845 into axolotl-ai-cloud:main Jan 8, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently get the length of the tokenized docs #1063

Efficiently get the length of the tokenized docs #1063

RicardoDominguez commented Jan 8, 2024

winglian left a comment

Efficiently get the length of the tokenized docs #1063

Efficiently get the length of the tokenized docs #1063

Conversation

RicardoDominguez commented Jan 8, 2024

winglian left a comment

Choose a reason for hiding this comment