-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add preprocess the pile AI example #82
Conversation
@scott-routledge2 could you test with full data on the platform and collect performance data? You can install the Hugging Face dependencies using |
What is meant by "full data" is there a different dataset name or split? |
The "monology/pile-uncopyrighted" is the full data which is already in the code. Maybe remove the "revision" argument if it doesn't have all the data. |
Removed. |
Results from running 1/30 files (5899215 examples) on a single node, r6i.16xlarge instance Total e2e script time: 1817.9603700637817 (with the dataset already cached) Gave up on running pandas since it ran for 2+ hours so I tried it on smaller dataset first: On a smaller dataset (100,000 examples) Bodo is ~10x faster than pandas. Second run with cache=True: Pandas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
df = df.drop_duplicates(subset=["text_hash"]) | ||
df = df.drop("text_hash", axis=1) | ||
processed_data = df.apply(tokenize_data, axis=1) | ||
processed_data.to_json(out_file, orient="records", lines=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the full dataset it might make sense to write to parquet files as opposed to a single jsonl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the original example (I assumed the downstream system expects JSONL). Do we write separate JSONL files per core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, and no it looks like there is a single jsonl at the end.
LGTM. Could we add a version to test this on Dask too to compare? |
I think so. Please open an issue. I'd compare to Dask after Hugging Face data load is implemented. |
Changes included in this PR
Adds an example demonstrating how to do data preprocessing for AI use cases using Bodo. It preprocesses "the pile" data from Hugging Face.
Testing strategy
Tested it manually with sample data locally. Need to test on larger AWS instances with full data.
User facing changes
A new example.
Checklist
[run CI]
in your commit message.