Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocess the pile AI example #82

Merged
merged 29 commits into from
Jan 3, 2025
Merged

Conversation

ehsantn
Copy link
Collaborator

@ehsantn ehsantn commented Dec 19, 2024

Changes included in this PR

Adds an example demonstrating how to do data preprocessing for AI use cases using Bodo. It preprocesses "the pile" data from Hugging Face.

Testing strategy

Tested it manually with sample data locally. Need to test on larger AWS instances with full data.

User facing changes

A new example.

Checklist

  • [N/A] Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 19, 2024

Waiting for #80 and #81 to test at scale.

@ehsantn ehsantn changed the base branch from main to ehsan/jit_wrapper December 27, 2024 20:24
@ehsantn ehsantn marked this pull request as ready for review December 27, 2024 20:31
@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 27, 2024

@scott-routledge2 could you test with full data on the platform and collect performance data? You can install the Hugging Face dependencies using pixi add datasets=3.2.0 transformers=4.47.1.

@scott-routledge2
Copy link
Contributor

@scott-routledge2 could you test with full data on the platform and collect performance data? You can install the Hugging Face dependencies using pixi add datasets=3.2.0 transformers=4.47.1.

What is meant by "full data" is there a different dataset name or split?

@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 27, 2024

The "monology/pile-uncopyrighted" is the full data which is already in the code. Maybe remove the "revision" argument if it doesn't have all the data.

@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 27, 2024

Removed.

Base automatically changed from ehsan/jit_wrapper to main December 27, 2024 22:34
@scott-routledge2
Copy link
Contributor

scott-routledge2 commented Jan 2, 2025

Results from running 1/30 files (5899215 examples) on a single node, r6i.16xlarge instance

Total e2e script time: 1817.9603700637817 (with the dataset already cached)
Total execution time (inside the top level bodo jit): 1406.9670388349996 (23 minutes)

Gave up on running pandas since it ran for 2+ hours so I tried it on smaller dataset first:

On a smaller dataset (100,000 examples) Bodo is ~10x faster than pandas.
Total execution time: 24.935321941999973
Total e2e script time: 47.1090874671936

Second run with cache=True:
Total execution time: 24.580993089000003
Total e2e script time: 36.09507441520691

Pandas:
Total execution time: 317.7317636013031
Total e2e script time: 319.26537132263184

Copy link
Contributor

@scott-routledge2 scott-routledge2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

df = df.drop_duplicates(subset=["text_hash"])
df = df.drop("text_hash", axis=1)
processed_data = df.apply(tokenize_data, axis=1)
processed_data.to_json(out_file, orient="records", lines=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the full dataset it might make sense to write to parquet files as opposed to a single jsonl

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the original example (I assumed the downstream system expects JSONL). Do we write separate JSONL files per core?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and no it looks like there is a single jsonl at the end.

@strangeloopcanon
Copy link
Contributor

LGTM. Could we add a version to test this on Dask too to compare?

@ehsantn
Copy link
Collaborator Author

ehsantn commented Jan 3, 2025

LGTM. Could we add a version to test this on Dask too to compare?

I think so. Please open an issue. I'd compare to Dask after Hugging Face data load is implemented.

@ehsantn ehsantn merged commit 51003ac into main Jan 3, 2025
8 checks passed
@ehsantn ehsantn deleted the ehsan/preprocess_pile_example branch January 3, 2025 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants