Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BSE-4271] Add NYC Taxi benchmark code #6

Merged
merged 5 commits into from
Dec 5, 2024
Merged

Conversation

ehsantn
Copy link
Collaborator

@ehsantn ehsantn commented Dec 3, 2024

Changes included in this PR

Adds the initial Bodo code for NYC Taxi Monthly Trips with Precipitation.
Uses the "High Volume For-Hire Vehicle Trip Records" data in here since it has more records: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Testing strategy

Testing it manually for now.

User facing changes

NA

Checklist

  • Pipelines passed before requesting review. To run CI you must include [run CI] in your commit message.
  • I am familiar with the Contributing Guide
  • I have installed + ran pre-commit hooks.

Comment on lines -196 to +199
" parse_dates=[\"date\"]\n",
" parse_dates=[\"DATE\"]\n",
" )\n",
" central_park_weather_observations = central_park_weather_observations.rename(\n",
" columns={\"DATE\": \"date\", \"PRCP\": \"precipitation\"}, copy=False\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to update column names here since I updated the s3://bodo-example-data/nyc-taxi/central_park_weather.csv file.

Copy link
Contributor

@scott-routledge2 scott-routledge2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I also reviewed the SQL code and they look equivalent to me. Maybe it would be useful to implement some kind of checksum to make sure all of the tools create the same result.

Copy link
Contributor

@IsaacWarren IsaacWarren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but I don't see any of the filtering in the where clause in the sql query in our Python implementation

@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 5, 2024

Avoided the filters to keep the input data as large as possible. But we can add them as necessary.

@ehsantn
Copy link
Collaborator Author

ehsantn commented Dec 5, 2024

Maybe removing nulls in keys would improve groupby performance.

@ehsantn ehsantn merged commit ca8c70c into main Dec 5, 2024
6 checks passed
@ehsantn ehsantn deleted the ehsan/nyc_taxi_benchmark branch December 5, 2024 03:25
scott-routledge2 pushed a commit that referenced this pull request Dec 16, 2024
* Initial commit

* [BP-454] IPyParallel Wrapper Kernel for Platform (#1)

* [BP-454] Improvements and fixes (#2)

* Fix README

* Add PyPi badge

* Allow users to create their own IPy clusters, formatting

* Fix typo

* Simplify a little

* Invert magic check logic

* Small change in example ipcluster config

* Generate IPYPARALLEL_MAGICS from line and cell magics

* Fix typo in comment

* Modify `example_ipcluster_config.py`  (#3)

* [BP-1032] Check `update_hostfile.sh` in multiple locations including `/home/bodo` (#4)

* Check update_hostfile in multiple locations including /home/bodo

* Add .DS_Store to .gitignore

* [BP-902] feat/custom-launcher (#5)

* feat/custom-launcher

* rfac: rename launcher

* rfac: default error message when exit code doesnt match

* chore: register launcher entrypoint

* chore: handle px autopx errors

* Update launcher.py

* Update launcher.py

* chore: update logging

* chore: minor fixes

* chore: minor fixes

* rfac: check for engine running status

* Update launcher.py

* fix: include bodo custom magics

* Update bodo_platform_ipyparallel_kernel/kernel.py

Co-authored-by: Sahil Gupta <sahil@bodo.ai>

* chore: minor fixes

* chore: minor fixes

* fix: formatting

* chore: minor fixes

* chore: allow users to run pxresult or disable autopx when engines not running

Co-authored-by: Sahil Gupta <sahil@bodo.ai>

* Fix issue with autopx saying it's enabled after cluster creation failed (#7)

* Return when cluster isn't created succesfully

* Fix issue with cell remaining in pending state after cluster creation fails

Co-authored-by: Isaac <Isaac@bodo.com>

* BP-1607: Shutdown view on error (#8)

* Shutdown view

* Add comment

Co-authored-by: Isaac <Isaac@bodo.com>

* BP 1607: IPyParallel Engines not killed (#9)

* Shutdown view

* Add comment

* Add extra shutdown to stop_ipyparallel_cluster

* Kill engines after shutdown

* Add async

* Change order of shutdown

* Stop engines

* Add debug messages

* Switch to sigterm

* Remove async

* Remove check

* If there are outstanding tasks abort and kill engines

* Add await

* Use client instead of cluster

* Fix member name

* Try sync

* Remove unecessary shutdown

* Readd

* Move if

* Run formatter

* Try forcekilling after stop_cluster

* Revert

* Add comment

* Stop engines and controller through cluster interface

* Add comments and manually remove cluster file

* Formatting

* Readd shutdown

* Replace stop_cluster_sync with shutdown

* Change to cluster api

* Add debug

* Add block=True

* Add wrap future

* Wait with timneout

* Wrap in wait

* Always SigKill

* Add stop_cluster_sync at end of shutdown

* Add catch for filenotfound

* Add comments

Co-authored-by: Isaac <Isaac@bodo.com>
Co-authored-by: Sahil <sahil@bodo.ai>

* [BP-1486][BP-1826] Add support for SQL language mode using the `%%sql` magic (#6)

This adds support for the new SQL language mode on our platform. This is done by converting SQL queries to a %%sql magic execution.
The language mode and catalog detection is done in execute_request where this information is readily available. This function is run before do_execute, which is why we get the required information from the execution metadata here, and convert the code that's then sent to do_execute. Since we want to run SQL queries on the IPyParallel cluster by default, we also add the %%px magic automatically (if %autopx isn't enabled). Just like any other IPP magics, this would therefore start an IPP cluster if one is not already initialized.

Co-authored-by: Isaac <Isaac@bodo.com>
Co-authored-by: meehawk <meehawk@yahoo.com>
Co-authored-by: Sahil <sahil@bodo.ai>

* [BP-2119] Replace MPI EngineSetLauncher with Slurm (#11)

* replace MPI EngineSetLauncher with Slurm

* check if Slurm installed

* workaround ipp signal bug

* add comment

* Remove dead ipcluster config file

* [BP-880] Add Parallel-Python Mode Support (#10)

This adds support for a "Parallel-Python" language mode, in addition to the existing "Python" and "SQL" language modes. This has the following implications:
- `%autopx` is no longer required, and in fact will be ignored by the kernel.
- Functionality of the "Python" mode has been limited to running things on a single core (note that it's not rank 0 of the IPyParallel cluster, rather a separate process). This mode is now limited to running sequential code (for debugging, etc.) and our custom line magics like `%pconda`, etc. IPyParallel magics are not supported, and the kernel will show an error message if they're used.
- "Parallel-Python" is really the new default. It will add `%%px` to all code (unless the user is already using an IPyParallel magic).
- `execute_request` acts like a pre-parser for all code based on the language mode, reducing the complexity in `do_execute`.
- `execute_request` does the following:
    - For SQL mode:
        - If IPyParallel, SQL or Bodo Platform magics (`%pconda`, etc.) are used, an error will be shown and no code will be run.
        - If the code is regular SQL (which we assume it is if it doesn't use the magics mentioned above), we will add the `%%sql` magic along with the catalog details to the code. If a catalog is not specified, an error message will be shown to the users. We will also add `%%px` so it's executed on the IPyParallel cluster.
        - If the code turns out not to be SQL, the behavior is undefined (most likely a BodoSQL parsing error, but could be something else).
    - For Python mode:
        - If any of the IPyParallel magics are used, an error will be shown to the user to use the Parallel-Python mode instead.
        - Everything else will be passed along as is, including our custom line magics (`%pconda`, etc.)
    - For Parallel Python mode:
        - If `%autopx` is used, a warning will be shown to the user and the code will be disregarded.
        - Any other IPyParallel magics will be passed through as is (this is safe since `autopx` won't be on, and the semantics should always be correct while in Parallel-Python mode).
        - Our custom magics (`%pconda`, etc.) will be passed through as is (this is safe since `autopx` cannot be on, so they will get run on the separate single core process, which is the intended use case).
        - All other code will be prepended with `%%px`.
- `do_execute` is now simpler, in that it just starts an IPyParallel cluster whenever the code has any IPyParallel magics (`autopx` can't get through `execute_request`, but even if it did, `do_execute` will detect it and show a warning). It also handles not allowing running IPyParallel magics (except `%pxresult`) when the engines are dead (same as before). 


Co-authored-by: Isaac <isaac@bodo.com>

* revert to using MPIEngineLauncher (#12)

* revert to using MPI

* switching to using environment vars

* switching to using environment vars

* Update bodo_platform_ipyparallel_kernel/launcher.py

Co-authored-by: Sahil Gupta <sahil@bodo.ai>

Co-authored-by: Aaditya Srivathsan <aadityasrivathsan@Aadityas-MacBook-Pro-2.local>
Co-authored-by: Sahil Gupta <sahil@bodo.ai>

* changes

* update conda-lock

* change location

* fix test

* suggested changes

* fix

* update buildspec

* [run ci]

---------

Co-authored-by: Sahil Gupta <sahil1105@hotmail.com>
Co-authored-by: Sahil Gupta <sahil@bodo.ai>
Co-authored-by: meehawk <80167324+meehawk@users.noreply.github.com>
Co-authored-by: Isaac Warren <42949629+IsaacWarren@users.noreply.github.com>
Co-authored-by: Isaac <Isaac@bodo.com>
Co-authored-by: meehawk <meehawk@yahoo.com>
Co-authored-by: Ehsan Totoni <ehsan.tn@gmail.com>
Co-authored-by: Aaditya Srivathsan <102938027+aadits1004@users.noreply.github.com>
Co-authored-by: Aaditya Srivathsan <aadityasrivathsan@Aadityas-MacBook-Pro-2.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants