Do tokenization preprocessing in a process pool #57

talolard · 2020-03-22T14:28:39Z

For larger datasets, it's not great to do the tokenization on one core when we have many available. I'd suggest wrapping the relevant function in a process pool, or passing the pool as an argument and doing Pool.map

Happy to make a PR if it's a good fit for the repo

vampire/scripts/preprocess_data.py

Line 26 in 2613609

with tqdm(open(data_path, "r"), desc=f"loading {data_path}") as f:

talolard mentioned this issue Mar 22, 2020

feat(tokenization): Added flag to run tokenization on multiple cores #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do tokenization preprocessing in a process pool #57

Do tokenization preprocessing in a process pool #57

talolard commented Mar 22, 2020

Do tokenization preprocessing in a process pool #57

Do tokenization preprocessing in a process pool #57

Comments

talolard commented Mar 22, 2020