-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BSE-4271] Add NYC Taxi benchmark code #6
Conversation
@IsaacWarren @scott-routledge2 could you verify this matches the SQL version here (excluding a few output columns): https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/citibike_comparison/analysis/analysis_queries.sql#L1 Here is our previous version which had some bugs: |
" parse_dates=[\"date\"]\n", | ||
" parse_dates=[\"DATE\"]\n", | ||
" )\n", | ||
" central_park_weather_observations = central_park_weather_observations.rename(\n", | ||
" columns={\"DATE\": \"date\", \"PRCP\": \"precipitation\"}, copy=False\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to update column names here since I updated the s3://bodo-example-data/nyc-taxi/central_park_weather.csv
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I also reviewed the SQL code and they look equivalent to me. Maybe it would be useful to implement some kind of checksum to make sure all of the tools create the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but I don't see any of the filtering in the where clause in the sql query in our Python implementation
Avoided the filters to keep the input data as large as possible. But we can add them as necessary. |
Maybe removing nulls in keys would improve groupby performance. |
* Initial commit * [BP-454] IPyParallel Wrapper Kernel for Platform (#1) * [BP-454] Improvements and fixes (#2) * Fix README * Add PyPi badge * Allow users to create their own IPy clusters, formatting * Fix typo * Simplify a little * Invert magic check logic * Small change in example ipcluster config * Generate IPYPARALLEL_MAGICS from line and cell magics * Fix typo in comment * Modify `example_ipcluster_config.py` (#3) * [BP-1032] Check `update_hostfile.sh` in multiple locations including `/home/bodo` (#4) * Check update_hostfile in multiple locations including /home/bodo * Add .DS_Store to .gitignore * [BP-902] feat/custom-launcher (#5) * feat/custom-launcher * rfac: rename launcher * rfac: default error message when exit code doesnt match * chore: register launcher entrypoint * chore: handle px autopx errors * Update launcher.py * Update launcher.py * chore: update logging * chore: minor fixes * chore: minor fixes * rfac: check for engine running status * Update launcher.py * fix: include bodo custom magics * Update bodo_platform_ipyparallel_kernel/kernel.py Co-authored-by: Sahil Gupta <sahil@bodo.ai> * chore: minor fixes * chore: minor fixes * fix: formatting * chore: minor fixes * chore: allow users to run pxresult or disable autopx when engines not running Co-authored-by: Sahil Gupta <sahil@bodo.ai> * Fix issue with autopx saying it's enabled after cluster creation failed (#7) * Return when cluster isn't created succesfully * Fix issue with cell remaining in pending state after cluster creation fails Co-authored-by: Isaac <Isaac@bodo.com> * BP-1607: Shutdown view on error (#8) * Shutdown view * Add comment Co-authored-by: Isaac <Isaac@bodo.com> * BP 1607: IPyParallel Engines not killed (#9) * Shutdown view * Add comment * Add extra shutdown to stop_ipyparallel_cluster * Kill engines after shutdown * Add async * Change order of shutdown * Stop engines * Add debug messages * Switch to sigterm * Remove async * Remove check * If there are outstanding tasks abort and kill engines * Add await * Use client instead of cluster * Fix member name * Try sync * Remove unecessary shutdown * Readd * Move if * Run formatter * Try forcekilling after stop_cluster * Revert * Add comment * Stop engines and controller through cluster interface * Add comments and manually remove cluster file * Formatting * Readd shutdown * Replace stop_cluster_sync with shutdown * Change to cluster api * Add debug * Add block=True * Add wrap future * Wait with timneout * Wrap in wait * Always SigKill * Add stop_cluster_sync at end of shutdown * Add catch for filenotfound * Add comments Co-authored-by: Isaac <Isaac@bodo.com> Co-authored-by: Sahil <sahil@bodo.ai> * [BP-1486][BP-1826] Add support for SQL language mode using the `%%sql` magic (#6) This adds support for the new SQL language mode on our platform. This is done by converting SQL queries to a %%sql magic execution. The language mode and catalog detection is done in execute_request where this information is readily available. This function is run before do_execute, which is why we get the required information from the execution metadata here, and convert the code that's then sent to do_execute. Since we want to run SQL queries on the IPyParallel cluster by default, we also add the %%px magic automatically (if %autopx isn't enabled). Just like any other IPP magics, this would therefore start an IPP cluster if one is not already initialized. Co-authored-by: Isaac <Isaac@bodo.com> Co-authored-by: meehawk <meehawk@yahoo.com> Co-authored-by: Sahil <sahil@bodo.ai> * [BP-2119] Replace MPI EngineSetLauncher with Slurm (#11) * replace MPI EngineSetLauncher with Slurm * check if Slurm installed * workaround ipp signal bug * add comment * Remove dead ipcluster config file * [BP-880] Add Parallel-Python Mode Support (#10) This adds support for a "Parallel-Python" language mode, in addition to the existing "Python" and "SQL" language modes. This has the following implications: - `%autopx` is no longer required, and in fact will be ignored by the kernel. - Functionality of the "Python" mode has been limited to running things on a single core (note that it's not rank 0 of the IPyParallel cluster, rather a separate process). This mode is now limited to running sequential code (for debugging, etc.) and our custom line magics like `%pconda`, etc. IPyParallel magics are not supported, and the kernel will show an error message if they're used. - "Parallel-Python" is really the new default. It will add `%%px` to all code (unless the user is already using an IPyParallel magic). - `execute_request` acts like a pre-parser for all code based on the language mode, reducing the complexity in `do_execute`. - `execute_request` does the following: - For SQL mode: - If IPyParallel, SQL or Bodo Platform magics (`%pconda`, etc.) are used, an error will be shown and no code will be run. - If the code is regular SQL (which we assume it is if it doesn't use the magics mentioned above), we will add the `%%sql` magic along with the catalog details to the code. If a catalog is not specified, an error message will be shown to the users. We will also add `%%px` so it's executed on the IPyParallel cluster. - If the code turns out not to be SQL, the behavior is undefined (most likely a BodoSQL parsing error, but could be something else). - For Python mode: - If any of the IPyParallel magics are used, an error will be shown to the user to use the Parallel-Python mode instead. - Everything else will be passed along as is, including our custom line magics (`%pconda`, etc.) - For Parallel Python mode: - If `%autopx` is used, a warning will be shown to the user and the code will be disregarded. - Any other IPyParallel magics will be passed through as is (this is safe since `autopx` won't be on, and the semantics should always be correct while in Parallel-Python mode). - Our custom magics (`%pconda`, etc.) will be passed through as is (this is safe since `autopx` cannot be on, so they will get run on the separate single core process, which is the intended use case). - All other code will be prepended with `%%px`. - `do_execute` is now simpler, in that it just starts an IPyParallel cluster whenever the code has any IPyParallel magics (`autopx` can't get through `execute_request`, but even if it did, `do_execute` will detect it and show a warning). It also handles not allowing running IPyParallel magics (except `%pxresult`) when the engines are dead (same as before). Co-authored-by: Isaac <isaac@bodo.com> * revert to using MPIEngineLauncher (#12) * revert to using MPI * switching to using environment vars * switching to using environment vars * Update bodo_platform_ipyparallel_kernel/launcher.py Co-authored-by: Sahil Gupta <sahil@bodo.ai> Co-authored-by: Aaditya Srivathsan <aadityasrivathsan@Aadityas-MacBook-Pro-2.local> Co-authored-by: Sahil Gupta <sahil@bodo.ai> * changes * update conda-lock * change location * fix test * suggested changes * fix * update buildspec * [run ci] --------- Co-authored-by: Sahil Gupta <sahil1105@hotmail.com> Co-authored-by: Sahil Gupta <sahil@bodo.ai> Co-authored-by: meehawk <80167324+meehawk@users.noreply.github.com> Co-authored-by: Isaac Warren <42949629+IsaacWarren@users.noreply.github.com> Co-authored-by: Isaac <Isaac@bodo.com> Co-authored-by: meehawk <meehawk@yahoo.com> Co-authored-by: Ehsan Totoni <ehsan.tn@gmail.com> Co-authored-by: Aaditya Srivathsan <102938027+aadits1004@users.noreply.github.com> Co-authored-by: Aaditya Srivathsan <aadityasrivathsan@Aadityas-MacBook-Pro-2.local>
Changes included in this PR
Adds the initial Bodo code for NYC Taxi Monthly Trips with Precipitation.
Uses the "High Volume For-Hire Vehicle Trip Records" data in here since it has more records: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Testing strategy
Testing it manually for now.
User facing changes
NA
Checklist
[run CI]
in your commit message.