Add preprocess the pile AI example #82

ehsantn · 2024-12-19T22:19:29Z

Changes included in this PR

Adds an example demonstrating how to do data preprocessing for AI use cases using Bodo. It preprocesses "the pile" data from Hugging Face.

Testing strategy

Tested it manually with sample data locally. Need to test on larger AWS instances with full data.

User facing changes

A new example.

Checklist

[N/A] Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
I am familiar with the Contributing Guide
I have installed + ran pre-commit hooks.

ehsantn · 2024-12-19T22:20:00Z

Waiting for #80 and #81 to test at scale.

ehsantn · 2024-12-27T20:34:23Z

@scott-routledge2 could you test with full data on the platform and collect performance data? You can install the Hugging Face dependencies using pixi add datasets=3.2.0 transformers=4.47.1.

scott-routledge2 · 2024-12-27T20:57:11Z

@scott-routledge2 could you test with full data on the platform and collect performance data? You can install the Hugging Face dependencies using pixi add datasets=3.2.0 transformers=4.47.1.

What is meant by "full data" is there a different dataset name or split?

ehsantn · 2024-12-27T21:30:22Z

The "monology/pile-uncopyrighted" is the full data which is already in the code. Maybe remove the "revision" argument if it doesn't have all the data.

ehsantn · 2024-12-27T21:31:08Z

Removed.

scott-routledge2 · 2025-01-02T20:17:51Z

Results from running 1/30 files (5899215 examples) on a single node, r6i.16xlarge instance

Total e2e script time: 1817.9603700637817 (with the dataset already cached)
Total execution time (inside the top level bodo jit): 1406.9670388349996 (23 minutes)

Gave up on running pandas since it ran for 2+ hours so I tried it on smaller dataset first:

On a smaller dataset (100,000 examples) Bodo is ~10x faster than pandas.
Total execution time: 24.935321941999973
Total e2e script time: 47.1090874671936

Second run with cache=True:
Total execution time: 24.580993089000003
Total e2e script time: 36.09507441520691

Pandas:
Total execution time: 317.7317636013031
Total e2e script time: 319.26537132263184

scott-routledge2

LGTM!

scott-routledge2 · 2025-01-02T22:13:59Z

examples/preprocess_thepile_bodo.py

+    df = df.drop_duplicates(subset=["text_hash"])
+    df = df.drop("text_hash", axis=1)
+    processed_data = df.apply(tokenize_data, axis=1)
+    processed_data.to_json(out_file, orient="records", lines=True)


For the full dataset it might make sense to write to parquet files as opposed to a single jsonl

This is from the original example (I assumed the downstream system expects JSONL). Do we write separate JSONL files per core?

Makes sense, and no it looks like there is a single jsonl at the end.

strangeloopcanon · 2025-01-02T23:01:46Z

LGTM. Could we add a version to test this on Dask too to compare?

ehsantn · 2025-01-03T01:59:27Z

LGTM. Could we add a version to test this on Dask too to compare?

I think so. Please open an issue. I'd compare to Dask after Hugging Face data load is implemented.

Add preprocess the pile AI example

1372180

ehsantn added 24 commits December 22, 2024 13:39

jit_wrapper typing

4d11f35

add lowering

f77d726

save call signatures

bd3240f

fix func analysis warning

e26af75

handle user function exceptions

537281b

Merge branch 'main' into ehsan/jit_wrapper

5fe41dd

Handle UDFs

e27d698

Fix output handling

07406ea

add comments

10b3f08

check return type

9fecd56

Series.map test

79f0749

support and test df.apply()

fb06189

support and test groupby.apply()

90402c3

distribution analysis and add tests

8321cbd

test error handling

dcfedce

fix hang

242850a

minor gaps

03a492c

fix type unification error

02907b4

serialize/unserialize the function properly

c970451

check output type

bd42b6b

minor shortcut

8785678

[run CI]

535bc89

rename jit_wrapper to wrap_python [run ci]

c499668

Merge branch 'ehsan/jit_wrapper' into ehsan/preprocess_pile_example

ed85456

ehsantn changed the base branch from main to ehsan/jit_wrapper December 27, 2024 20:24

use wrap_python decorator

7f8ef64

ehsantn marked this pull request as ready for review December 27, 2024 20:31

ehsantn requested a review from strangeloopcanon December 27, 2024 20:33

ehsantn requested a review from scott-routledge2 December 27, 2024 20:33

remove revision

c8c051a

Base automatically changed from ehsan/jit_wrapper to main December 27, 2024 22:34

ehsantn added 2 commits December 27, 2024 17:35

Merge branch 'main' into ehsan/preprocess_pile_example

0561b81

restore pixi files

142dce4

scott-routledge2 approved these changes Jan 2, 2025

View reviewed changes

strangeloopcanon approved these changes Jan 2, 2025

View reviewed changes

ehsantn merged commit 51003ac into main Jan 3, 2025
8 checks passed

ehsantn deleted the ehsan/preprocess_pile_example branch January 3, 2025 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add preprocess the pile AI example #82

Add preprocess the pile AI example #82

ehsantn commented Dec 19, 2024 •

edited

Loading

ehsantn commented Dec 19, 2024

ehsantn commented Dec 27, 2024 •

edited by scott-routledge2

Loading

scott-routledge2 commented Dec 27, 2024

ehsantn commented Dec 27, 2024

ehsantn commented Dec 27, 2024

scott-routledge2 commented Jan 2, 2025 •

edited

Loading

scott-routledge2 left a comment

scott-routledge2 Jan 2, 2025

ehsantn Jan 2, 2025

scott-routledge2 Jan 2, 2025

strangeloopcanon commented Jan 2, 2025

ehsantn commented Jan 3, 2025

Add preprocess the pile AI example #82

Add preprocess the pile AI example #82

Conversation

ehsantn commented Dec 19, 2024 • edited Loading

Changes included in this PR

Testing strategy

User facing changes

Checklist

ehsantn commented Dec 19, 2024

ehsantn commented Dec 27, 2024 • edited by scott-routledge2 Loading

scott-routledge2 commented Dec 27, 2024

ehsantn commented Dec 27, 2024

ehsantn commented Dec 27, 2024

scott-routledge2 commented Jan 2, 2025 • edited Loading

scott-routledge2 left a comment

Choose a reason for hiding this comment

scott-routledge2 Jan 2, 2025

Choose a reason for hiding this comment

ehsantn Jan 2, 2025

Choose a reason for hiding this comment

scott-routledge2 Jan 2, 2025

Choose a reason for hiding this comment

strangeloopcanon commented Jan 2, 2025

ehsantn commented Jan 3, 2025

ehsantn commented Dec 19, 2024 •

edited

Loading

ehsantn commented Dec 27, 2024 •

edited by scott-routledge2

Loading

scott-routledge2 commented Jan 2, 2025 •

edited

Loading