[BSE-4385] Add locally runnable benchmarks #65

scott-routledge2 · 2024-12-13T23:53:22Z

Changes included in this PR

Adds locally running benchmarks and README to benchmark folder.

Testing strategy

Ran all scripts locally.

User facing changes

None

Checklist

Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
I am familiar with the Contributing Guide
I have installed + ran pre-commit hooks.

scott-routledge2 · 2024-12-13T23:55:54Z

@marquisdepolis if you have the chance can you follow the instructions in the README to make sure the scripts run for you?

IsaacWarren

LGTM, thanks Scott

strangeloopcanon · 2024-12-14T00:39:28Z

I'd add:

say to Pixi install
pip install bodo
go to benchmark folder before running the python part

strangeloopcanon

Didn't run for me yet, environment issue, but not blocking the release.

ehsantn · 2024-12-15T17:46:29Z

@scott-routledge2 let's sync on this tomorrow morning.

scott-routledge2 · 2024-12-16T15:12:59Z

Move local version into nyc_taxi/local_versions
Remove pixi and add pip
Add script run_local:
call main scripts of the versions with argument file_name
changes to graph

README structure:

Name of workload
Dataset
Date of results, cluster configurations used
Input dataset size
Versions used

scott-routledge2 · 2024-12-16T22:05:19Z

benchmarks/README.md

+```
+
+We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). 
+


Not sure if we wanted to include hard number here because they could change based on things like network speed etc. I think the better comparison for the local benchmark is able to handle more data (which the script let's you play with different data sizes pretty easily). Later we could graph memory usage or maximum number of rows processed for different systems.

I'd still show the hard numbers, eevn with a caveat

Maybe some hard numbers here would be useful too. We can position them properly.

scott-routledge2 · 2024-12-16T22:07:51Z

benchmarks/nyc_taxi/run_local.py

+
+    assert len(files) != 1, "Spark benchmark expects a single path argument."
+
+    return files[0].replace("s3://", "s3a://").replace("fhvhv/", "fhvhv-rewrite/")


PySpark has slightly different syntax for reading multiple parquet files so we'd have to modify the script to allow this. I think it's fine to just cap it at one since it can't handle more than one parquet file from this dataset anyways (runs out of memory).

strangeloopcanon · 2024-12-16T22:12:43Z

benchmarks/README.md

+```
+
+We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). 
+


I'd still show the hard numbers, eevn with a caveat

ehsantn

Thanks @scott-routledge2 . See comments. The cluster configuration is the main one.

ehsantn · 2024-12-17T02:00:03Z

benchmarks/README.md

@@ -0,0 +1,63 @@
+# Benchmarks
+
+## Monthly High Volume for Hire Vehicle Trips with Precipitation


I think the name of the benchmark doesn't need to be tied to the dataset name. Same code runs on other types of the NYC Taxi data with minor changes.

Suggested change

## Monthly High Volume for Hire Vehicle Trips with Precipitation

## NYC Taxi Monthly Trips with Precipitation

ehsantn · 2024-12-17T02:02:53Z

benchmarks/README.md

+
+## Monthly High Volume for Hire Vehicle Trips with Precipitation
+
+For this benchmark, we adapt a [SQL query](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday.


SQL is actually a bit negative in this context. I think the original code was actually in R and was rewritten into SQL later.

Suggested change

For this benchmark, we adapt a [SQL query](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday.

For this benchmark, we adapt an [example data science workload](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday.

ehsantn · 2024-12-17T02:04:07Z

benchmarks/README.md

+
+### Dataset
+
+The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)(FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns. 


Just a space needed:

Suggested change

The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)(FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns.

The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) (FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns.

ehsantn · 2024-12-17T02:05:10Z

benchmarks/README.md

+
+### Setting
+
+For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.


Use active voice and other minor changes:

Suggested change

For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.

For this benchmark, we use the full FHVHV dataset stored in Parquet files on S3. The total size of this dataset is 24.7 GiB. The Central Park Weather data is stored in a single CSV file on S3 and its total size is 514 KiB.

ehsantn · 2024-12-17T02:05:53Z

benchmarks/README.md

+
+For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.
+
+We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below. 


Suggested change

We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below.

We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below.

ehsantn · 2024-12-17T02:07:30Z

benchmarks/README.md

+
+For cluster creation and configuration, we used the [Bodo SDK](https://docs.bodo.ai/2024.12/guides/using_bodo_platform/bodo_platform_sdk_guide/) for Bodo, Dask Cloudprovider for Dask, Ray for Modin, and AWS EMR for Spark. Scripts to configure and launch clusters for each system can be found in the same directory as the implementation.
+
+Each benchmark was collected on a cluster containing 4 worker instances and 128 physical cores. Dask, Modin, and Spark used 4 `r6i.16xlarge` instances, each consisting of 32 physical cores and 256 GiB of memory. Dask Cloudprovider also allocated an additional `r6i.16xlarge` instance for the scheduler. Bodo was run on 4 `c6i.16xlarge` instances, each consisting of 32 physical cores and 64 GiB of memory.


The cluster configuration should be the same for all systems. It's ok if Dask uses a small extra instance for scheduler.

ehsantn · 2024-12-17T02:08:00Z

benchmarks/README.md

+
+The graph below summarizes the total execution time of each system (averaged over 3 runs). Results were last collected on December 12th, 2024.
+
+<img src="./img/nyc_taxi.png" alt="Monthly High Volume for Hire Vehicle Trips with Precipitation Benchmark Execution Time" title="Monthly High Volume for Hire Vehicle Trips with Precipitation Average Execution Time" width="30%">


Let's use the same file path in top level README.md as well.

ehsantn · 2024-12-17T02:09:07Z

benchmarks/README.md

+python -m nyc_taxi.run_local
+```
+
+We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). 


What is "modest improvement"? We don't want to be modest when presenting benchmarks :)

So bodo is roughly 3x better than dask or even regular pandas in this case. I wanted to highlight that for smaller data sizes these solutions can be roughly equivalent and then in the next paragraph introduce the result that shows Dask OOM which further supports what we are claiming to be good at .

I think this is a really good result and we should showcase it as such, including a chart that we can share/ brag about

Will open up a follow up to create the chart. Did we want to create a chart for OOM too? Also should we do a comparison with pandas here as well since it is pretty competitive on smaller datasets.

If we can, yes, let's not be shy about creating a chart to show this.

ehsantn · 2024-12-17T02:09:44Z

benchmarks/README.md

+```
+
+We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask). 
+


Maybe some hard numbers here would be useful too. We can position them properly.

ehsantn · 2024-12-17T14:35:10Z

benchmarks/nyc_taxi/run_local.py

@@ -0,0 +1,78 @@
+"""Run local version of Bodo, Dask, Modin, and Pyspark benchmarks


Suggested change

"""Run local version of Bodo, Dask, Modin, and Pyspark benchmarks

"""Run local version of Bodo, Dask, Modin, and PySpark benchmarks

IsaacWarren

Thanks Scott

scott-routledge2 requested review from strangeloopcanon, IsaacWarren and ehsantn December 13, 2024 23:54

IsaacWarren approved these changes Dec 14, 2024

View reviewed changes

strangeloopcanon approved these changes Dec 14, 2024

View reviewed changes

strangeloopcanon approved these changes Dec 15, 2024

View reviewed changes

IsaacWarren force-pushed the main branch from 31f2714 to e1c7895 Compare December 16, 2024 04:38

= added 2 commits December 16, 2024 09:21

add locally runnable benchmarks

5c25de9

fix path

5c0df31

scott-routledge2 force-pushed the scott/local_benchmark branch from 16d73d6 to 5c0df31 Compare December 16, 2024 14:25

= added 2 commits December 16, 2024 10:24

Merge branch 'main' into scott/local_benchmark

2175b4b

update benchmark structure, add scripts and change readme

e88df28

scott-routledge2 commented Dec 16, 2024

View reviewed changes

fix graph

fd7eef4

scott-routledge2 requested review from ehsantn, IsaacWarren and strangeloopcanon and removed request for ehsantn December 16, 2024 22:10

strangeloopcanon approved these changes Dec 16, 2024

View reviewed changes

ehsantn approved these changes Dec 17, 2024

View reviewed changes

IsaacWarren approved these changes Dec 17, 2024

View reviewed changes

= added 4 commits December 17, 2024 15:09

README changes

bf9dc80

Merge branch 'main' into scott/local_benchmark

903b312

minor comments

925e126

fix small typos

76cbabd

minor fixes

6ec4515

scott-routledge2 merged commit bbc0475 into main Dec 17, 2024
8 checks passed

scott-routledge2 deleted the scott/local_benchmark branch December 17, 2024 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BSE-4385] Add locally runnable benchmarks #65

[BSE-4385] Add locally runnable benchmarks #65

scott-routledge2 commented Dec 13, 2024 •

edited

Loading

scott-routledge2 commented Dec 13, 2024 •

edited

Loading

IsaacWarren left a comment

strangeloopcanon commented Dec 14, 2024

strangeloopcanon left a comment

ehsantn commented Dec 15, 2024

scott-routledge2 commented Dec 16, 2024

scott-routledge2 Dec 16, 2024

strangeloopcanon Dec 16, 2024

ehsantn Dec 17, 2024

scott-routledge2 Dec 16, 2024

strangeloopcanon Dec 16, 2024

ehsantn left a comment

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

scott-routledge2 Dec 17, 2024

strangeloopcanon Dec 17, 2024

scott-routledge2 Dec 17, 2024

strangeloopcanon Dec 17, 2024

ehsantn Dec 17, 2024

ehsantn Dec 17, 2024

IsaacWarren left a comment

		```

		We used a smaller subset of the [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) to allow the workload to run locally on an Apple M2 Macbook Pro with 10 cores and 16 GB memory. Even at this smaller scale, Bodo shows a modest improvement over the next best system (Dask).


		assert len(files) != 1, "Spark benchmark expects a single path argument."

		return files[0].replace("s3://", "s3a://").replace("fhvhv/", "fhvhv-rewrite/")

		@@ -0,0 +1,63 @@
		# Benchmarks

		## Monthly High Volume for Hire Vehicle Trips with Precipitation

	## Monthly High Volume for Hire Vehicle Trips with Precipitation
	## NYC Taxi Monthly Trips with Precipitation


		## Monthly High Volume for Hire Vehicle Trips with Precipitation

		For this benchmark, we adapt a [SQL query](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1) into a pandas workload that reads from a public S3 bucket and calculates the average trip duration and number of trips based on features like weather conditions, pickup and dropoff location, month, and whether the trip was on a weekday.


		### Dataset

		The New York City Taxi and Limousine Commission's [For Hire Vehicle High Volume dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)(FHVHV) consists of over one billion trips taken by "for hire vehicles" including Uber and Lyft. To get the weather on a given day, we use a separate dataset of [Central Park weather observations](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv). The For Hire Vehicle High Volume dataset consists of 1,036,465,968 rows and 24 columns. The Central Park Weather dataset consists of 5,538 rows and 9 columns.


		### Setting

		For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.

	For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.
	For this benchmark, we use the full FHVHV dataset stored in Parquet files on S3. The total size of this dataset is 24.7 GiB. The Central Park Weather data is stored in a single CSV file on S3 and its total size is 514 KiB.


		For this benchmark, we used the full FHVHV dataset stored in parquet files on S3. The total size of this dataset was 24.7 GiB. The Central Park Weather data was stored in a single csv file on S3 and it's total size was 514 KiB.

		We compared Bodo's performance on this workload to other systems including [Dask](https://www.dask.org/), [Modin on Ray](https://docs.ray.io/en/latest/ray-more-libs/modin/index.html), and [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html) and observed a speedup of 20-240x. The implementations for all of these systems can be found in [`nyc_taxi`](./nyc_taxi/). Versions of the packages used are summarized below.


		For cluster creation and configuration, we used the [Bodo SDK](https://docs.bodo.ai/2024.12/guides/using_bodo_platform/bodo_platform_sdk_guide/) for Bodo, Dask Cloudprovider for Dask, Ray for Modin, and AWS EMR for Spark. Scripts to configure and launch clusters for each system can be found in the same directory as the implementation.

		Each benchmark was collected on a cluster containing 4 worker instances and 128 physical cores. Dask, Modin, and Spark used 4 `r6i.16xlarge` instances, each consisting of 32 physical cores and 256 GiB of memory. Dask Cloudprovider also allocated an additional `r6i.16xlarge` instance for the scheduler. Bodo was run on 4 `c6i.16xlarge` instances, each consisting of 32 physical cores and 64 GiB of memory.


		The graph below summarizes the total execution time of each system (averaged over 3 runs). Results were last collected on December 12th, 2024.

		<img src="./img/nyc_taxi.png" alt="Monthly High Volume for Hire Vehicle Trips with Precipitation Benchmark Execution Time" title="Monthly High Volume for Hire Vehicle Trips with Precipitation Average Execution Time" width="30%">

		@@ -0,0 +1,78 @@
		"""Run local version of Bodo, Dask, Modin, and Pyspark benchmarks

[BSE-4385] Add locally runnable benchmarks #65

[BSE-4385] Add locally runnable benchmarks #65

Conversation

scott-routledge2 commented Dec 13, 2024 • edited Loading

Changes included in this PR

Testing strategy

User facing changes

Checklist

scott-routledge2 commented Dec 13, 2024 • edited Loading

IsaacWarren left a comment

Choose a reason for hiding this comment

strangeloopcanon commented Dec 14, 2024

strangeloopcanon left a comment

Choose a reason for hiding this comment

ehsantn commented Dec 15, 2024

scott-routledge2 commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehsantn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IsaacWarren left a comment

Choose a reason for hiding this comment

scott-routledge2 commented Dec 13, 2024 •

edited

Loading

scott-routledge2 commented Dec 13, 2024 •

edited

Loading