Fix(tokenizing): Use multi-core #293

NanoCode012 · 2023-07-19T01:50:52Z

This improves speed for tokenizer and adds a progress bar.

Benchmark:

dataset	old	new
teknium/GPT4-LLM-Cleaned	~70s	< 10s
WizardLM/WizardLM_evol_instruct_V2_196k	~5 min	~55s

Note: just internally counting, nothing strict.

Future PR for more speed can:

use batching. (requires changing class to not use iter)
fix Dataset.from_list and shuffle (do we need to save to a list? why not save to a dataset directly to cut off this part)
ConstantIterList can be improved

Full credit to: neverendingtoast

winglian

🔥

NanoCode012 · 2023-07-19T02:07:36Z

One note: this removes the error if dataset is empty. We can simply just add into it the data.py file instead.

…-speed Fix(tokenizing): Use multi-core

feat: use multi-core

45ac7c4

winglian approved these changes Jul 19, 2023

View reviewed changes

NanoCode012 merged commit 28fd429 into axolotl-ai-cloud:main Jul 19, 2023
3 checks passed

NanoCode012 deleted the fix/tokenize-speed branch July 19, 2023 02:02

winglian added the enhancement New feature or request label Jul 22, 2023

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

Merge pull request axolotl-ai-cloud#293 from NanoCode012/fix/tokenize…

d29c002

…-speed Fix(tokenizing): Use multi-core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(tokenizing): Use multi-core #293

Fix(tokenizing): Use multi-core #293

NanoCode012 commented Jul 19, 2023 •

edited

Loading

winglian left a comment

NanoCode012 commented Jul 19, 2023

Fix(tokenizing): Use multi-core #293

Fix(tokenizing): Use multi-core #293

Conversation

NanoCode012 commented Jul 19, 2023 • edited Loading

winglian left a comment

Choose a reason for hiding this comment

NanoCode012 commented Jul 19, 2023

NanoCode012 commented Jul 19, 2023 •

edited

Loading