Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking suite for all backends and formats #533

Merged
merged 2 commits into from
Dec 15, 2023

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Dec 15, 2023

This benchmark serves to drive backend development, coming in the next PRs. As it is implemented, we plug it in here.

Backends in Streaming parlance are foreign systems which StreamingDataset calls out to or reads directly in order to serve samples.

$ time p3 -m benchmarks.backends.write
Generate: 717.963 sec.                                                                               
Found directory at data/backends/, wiping it for reuse

Write split: small
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ format   │     sec │      samples │  usec/sp │          bytes │  files │   bytes/file │ max bytes/file │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ csv      │   1.519 │       32,768 │   46.344 │      2,926,965 │      3 │      975,655 │      2,795,297 │
    │ delta    │  14.511 │       32,768 │  442.834 │        965,066 │     66 │       14,622 │         29,836 │
    │ jsonl    │   1.186 │       32,768 │   36.192 │      3,549,499 │      3 │    1,183,166 │      3,417,881 │
    │ lance    │   0.077 │       32,768 │    2.350 │      2,983,593 │      4 │      745,898 │      2,983,041 │
    │ mds      │   0.166 │       32,768 │    5.073 │      2,982,097 │      2 │    1,491,048 │      2,981,772 │
    │ parquet  │   0.023 │       32,768 │    0.702 │      1,062,984 │      1 │    1,062,984 │      1,062,984 │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─

Write split: medium
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ format   │     sec │      samples │  usec/sp │          bytes │  files │   bytes/file │ max bytes/file │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ csv      │   3.610 │    1,048,576 │    3.443 │     93,686,651 │     23 │    4,073,332 │      8,388,602 │
    │ delta    │   4.098 │    1,048,576 │    3.908 │     29,391,561 │     66 │      445,326 │        912,199 │
    │ jsonl    │   6.550 │    1,048,576 │    6.247 │    113,610,514 │     29 │    3,917,603 │      8,388,579 │
    │ lance    │   0.552 │    1,048,576 │    0.526 │     95,495,604 │      7 │   13,642,229 │     23,878,871 │
    │ mds      │   4.497 │    1,048,576 │    4.289 │     95,456,073 │     13 │    7,342,774 │      8,388,608 │
    │ parquet  │   0.640 │    1,048,576 │    0.610 │     32,496,271 │      4 │    8,124,067 │      8,124,701 │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─

Write split: large
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ format   │     sec │      samples │  usec/sp │          bytes │  files │   bytes/file │ max bytes/file │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─
    │ csv      │  71.661 │   33,554,432 │    2.136 │  2,998,262,406 │    685 │    4,377,025 │      8,388,616 │
    │ jsonl    │ 174.832 │   33,554,432 │    5.210 │  3,635,816,298 │    837 │    4,343,866 │      8,388,608 │
    │ lance    │  14.238 │   33,554,432 │    0.424 │  3,056,134,893 │    131 │   23,329,273 │     23,891,240 │
    │ mds      │ 142.663 │   33,554,432 │    4.252 │  3,054,871,703 │    366 │    8,346,643 │      8,388,608 │
    │ parquet  │  20.720 │   33,554,432 │    0.617 │  1,039,974,715 │    128 │    8,124,802 │      8,129,982 │
    ─ ──────── ─ ─────── ─ ──────────── ─ ──────── ─ ────────────── ─ ────── ─ ──────────── ─ ────────────── ─

real    20m50.269s
user    20m1.404s
sys 0m28.315s

plot

@knighton knighton merged commit d969cd6 into dev Dec 15, 2023
6 checks passed
@knighton knighton deleted the james/bench-backends branch December 15, 2023 04:03
karan6181 pushed a commit that referenced this pull request Jan 26, 2024
* Benchmarking all backends and formats.

* Fix (missing docstrings).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant