-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BSE-4385] Add locally runnable benchmarks #65
Conversation
@marquisdepolis if you have the chance can you follow the instructions in the README to make sure the scripts run for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Scott
I'd add:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't run for me yet, environment issue, but not blocking the release.
@scott-routledge2 let's sync on this tomorrow morning. |
16d73d6
to
5c0df31
Compare
Move local version into README structure:
|
``` | ||
|
||
We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we wanted to include hard number here because they could change based on things like network speed etc. I think the better comparison for the local benchmark is able to handle more data (which the script let's you play with different data sizes pretty easily). Later we could graph memory usage or maximum number of rows processed for different systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still show the hard numbers, eevn with a caveat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some hard numbers here would be useful too. We can position them properly.
benchmarks/nyc_taxi/run_local.py
Outdated
|
||
assert len(files) != 1, "Spark benchmark expects a single path argument." | ||
|
||
return files[0].replace("s3://", "s3a://").replace("fhvhv/", "fhvhv-rewrite/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PySpark has slightly different syntax for reading multiple parquet files so we'd have to modify the script to allow this. I think it's fine to just cap it at one since it can't handle more than one parquet file from this dataset anyways (runs out of memory).
``` | ||
|
||
We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still show the hard numbers, eevn with a caveat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scott-routledge2 . See comments. The cluster configuration is the main one.
benchmarks/README.md
Outdated
@@ -0,0 +1,63 @@ | |||
# Benchmarks | |||
|
|||
## Monthly High Volume for Hire Vehicle Trips with Precipitation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name of the benchmark doesn't need to be tied to the dataset name. Same code runs on other types of the NYC Taxi data with minor changes.
## Monthly High Volume for Hire Vehicle Trips with Precipitation | |
## NYC Taxi Monthly Trips with Precipitation |
benchmarks/README.md
Outdated
|
||
## Monthly High Volume for Hire Vehicle Trips with Precipitation | ||
|
||
For this benchmark, we adapt a [SQL query](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL is actually a bit negative in this context. I think the original code was actually in R and was rewritten into SQL later.
For this benchmark, we adapt a [SQL query](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday. | |
For this benchmark, we adapt an [example data science workload](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday. |
benchmarks/README.md
Outdated
|
||
### Dataset | ||
|
||
The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)(FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a space needed:
The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)(FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns. | |
The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) (FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns. |
benchmarks/README.md
Outdated
|
||
### Setting | ||
|
||
For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use active voice and other minor changes:
For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB. | |
For this benchmark, we use the full FHVHV dataset stored in Parquet files on S3. The total size of this dataset is 24.7 GiB. The Central Park Weather data is stored in a single CSV file on S3 and its total size is 514 KiB. |
benchmarks/README.md
Outdated
|
||
For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB. | ||
|
||
We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below. | |
We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below. |
benchmarks/README.md
Outdated
|
||
For cluster creation and configuration, we used the [Bodo SDK](https://docs.bodo.ai/2024.12/guides/using_bodo_platform/bodo_platform_sdk_guide/) for Bodo, Dask Cloudprovider for Dask, Ray for Modin, and AWS EMR for Spark. Scripts to configure and launch clusters for each system can be found in the same directory as the implementation. | ||
|
||
Each benchmark was collected on a cluster containing 4 worker instances and 128 physical cores. Dask, Modin, and Spark used 4 `r6i.16xlarge` instances, each consisting of 32 physical cores and 256 GiB of memory. Dask Cloudprovider also allocated an additional `r6i.16xlarge` instance for the scheduler. Bodo was run on 4 `c6i.16xlarge` instances, each consisting of 32 physical cores and 64 GiB of memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster configuration should be the same for all systems. It's ok if Dask uses a small extra instance for scheduler.
benchmarks/README.md
Outdated
|
||
The graph below summarizes the total execution time of each system (averaged over 3 runs). Results were last collected on December 12th, 2024. | ||
|
||
<img src="./img/nyc_taxi.png" alt="Monthly High Volume for Hire Vehicle Trips with Precipitation Benchmark Execution Time" title="Monthly High Volume for Hire Vehicle Trips with Precipitation Average Execution Time" width="30%"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same file path in top level README.md as well.
benchmarks/README.md
Outdated
python -m nyc_taxi.run_local | ||
``` | ||
|
||
We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "modest improvement"? We don't want to be modest when presenting benchmarks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So bodo is roughly 3x better than dask or even regular pandas in this case. I wanted to highlight that for smaller data sizes these solutions can be roughly equivalent and then in the next paragraph introduce the result that shows Dask OOM which further supports what we are claiming to be good at .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a really good result and we should showcase it as such, including a chart that we can share/ brag about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open up a follow up to create the chart. Did we want to create a chart for OOM too? Also should we do a comparison with pandas here as well since it is pretty competitive on smaller datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can, yes, let's not be shy about creating a chart to show this.
``` | ||
|
||
We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some hard numbers here would be useful too. We can position them properly.
benchmarks/nyc_taxi/run_local.py
Outdated
@@ -0,0 +1,78 @@ | |||
"""Run local version of Bodo, Dask, Modin, and Pyspark benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Run local version of Bodo, Dask, Modin, and Pyspark benchmarks | |
"""Run local version of Bodo, Dask, Modin, and PySpark benchmarks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Scott
Changes included in this PR
Adds locally running benchmarks and README to benchmark folder.
Testing strategy
Ran all scripts locally.
User facing changes
None
Checklist
[run CI]
in your commit message.