forked from hail-is/hail
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add query graceful shutdown #35
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
illusional
force-pushed
the
add-query-graceful-shutdown
branch
from
February 23, 2021 22:56
c256e9d
to
2210349
Compare
lgruen
reviewed
Feb 25, 2021
lgruen
reviewed
Feb 25, 2021
lgruen
approved these changes
Feb 25, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great, thanks!
Added some minor comments.
# Because we don't need to be strict on the order of the # returned tasks, because we only care that they finish.
illusional
added a commit
that referenced
this pull request
Feb 25, 2021
…tdown Add query graceful shutdown
lgruen
pushed a commit
that referenced
this pull request
Mar 23, 2021
* Merge pull request #35 from populationgenomics/add-query-graceful-shutdown Add query graceful shutdown * Remove unused argument from query:on_shutdown
illusional
added a commit
that referenced
this pull request
Mar 23, 2021
* [batch] Worker cleanup (hail-is#10155) * [batch] Worker cleanup * more changes * wip * delint * additions? * fix * [query] Add `source_file_field` to `import_table` (hail-is#10164) * [query] Add `source_file_field` to `import_table` CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file. * ugh * [ci] add authorize sha and action items table to user page (hail-is#10142) * [ci] add authorize sha and action items table to user page * [ci] track review requested in addition to assigned for PR reviews * [ci] add CI dropdown with link to user page (hail-is#10163) * [batch] add more logs and do not wait for asyncgens (hail-is#10136) * [batch] add more logs and do not wait for asyncgens I think there is some unresolved issue with asyncgen shutdown that is keeping workers alive. This is not an issue in worker because worker calls sys.exit which forcibly stops execution. cc: @daniel-goldstein @jigold. * fix lint * [query-service] maybe fix event loop not initialized (hail-is#10153) * [query-service] maybe fix event loop not initialized The event loop is supposed to be initialized in the main thread. Sometimes our tests get placed in the non-main thread (always a thread named Dummy-1). Hopefully the session-scoped fixture is run in the main thread. * fix * [prometheus] add prometheus to track SLIs (hail-is#10165) * [prometheus] add prometheus to track SLIs * add wraps * [query] apply nest-asyncio as early as possible (hail-is#10158) * [query] apply nest-asyncio as early as possible * fix * [grafana] set pod fsGroup to grafana user (hail-is#10162) * fix linting errors (hail-is#10171) * [query] Remove verbose print (hail-is#10167) Looks like this got added in some dndarray work * [ci] update assignees and reviewers on PR github update (hail-is#10168) * [query-service] fix receive logic (hail-is#10159) * [query-service] fix receive logic Only one coro waits on receive now. We still error if a message is sent before we make our first response. * fix * fix * CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174) * [linting] add curlylint check for any service that renders jinja2 (hail-is#10172) * [linting] add curlylint check for any service that renders jinja2 templates * [linting] spaces not tabs * [website] fix website (hail-is#10173) * [website] fix website I build old versions of the docs and use them in new websites. This does not work for versions of the docs before I introduced the new system. In particular versions 0.2.63 and before generate old-style docs. * tutorials are templated * [ci] change mention for deploy failure (hail-is#10178) * [gateway] move ukbb routing into gateway (hail-is#10179) * [query] Fix filter intervals (keep=False) memory leak (hail-is#10182) * [query-service] remove service backend tests (hail-is#10180) They are too flaky currently due to the version issue. * [website] pass response body as kwarg (hail-is#10176) * Release 0.2.64 (hail-is#10183) * Bump version number * Updated changelog * [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181) * [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184) * [query-service] teach query service to read MTs and Ts created by Spark Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files should not be included in `listStatus` (they should always be empty anyway). Unfortunately, my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes the trailing slash. This prevented the path from matching `path`, which always ends in a `/`. * fix * [website] dont jinja render any of the batch docs (hail-is#10190) * [googlestoragefs] ignore the directory check entirely (hail-is#10185) * [googlestoragefs] ignore the directory check entirely If a file exists with the *same name as the directory we are listing*, then it must be a directory marker. It does not matter if that file is a directory or not. * Update GoogleStorageFS.scala * [ci] fix focus on slash and search job page for PRs (hail-is#10194) * [query] Improve file compatibility error (hail-is#10191) * Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189) * [query] NDArray Sum (hail-is#10187) * Attempt implementing the sum rule in Emit * Connected the python code, but not working yet * NDArrayExpression.sum is working now * Add default arg when no axis is provided * More comprehensive test * Unused imports * Use sum appropriately in linear_regression_rows_nd * Deleted extra blank line * Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions * Better assertions, with tests * Got the summation index correct * Add documentation * [website] fix resource path for non-html files in the docs (hail-is#10196) * [query] Remove tcode from primitive orderings (hail-is#10193) * [query] BlockMatrix map (hail-is#10195) * Add map, but protect users of the spark backend from writing arbitrary maps * If densify would have been a no-op, that should work * Densify and Sparsify are no-ops for now * Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar * Make the maps underscore methods * [query] Remove all uses of .tcode[Boolean] (hail-is#10198) * [ci] make test hello speak https (hail-is#10192) * [tls] make hello use tls * change pylint ignore message * [query] blanczos_pca dont do extra loading work (hail-is#10201) * Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt * Keep extra row fields from being included * Add query graceful shutdown for rolling updates (hail-is#10106) * Merge pull request #35 from populationgenomics/add-query-graceful-shutdown Add query graceful shutdown * Remove unused argument from query:on_shutdown * [auth] add more options for obtaining session id for dev credentials (hail-is#10203) * [auth] add more options for obtaining session id for dev credentials * [auth] extract userinfo query for use in both userinfo and verify_dev_credentials * remove unused import * [query] Default to Spark 3 (hail-is#10054) * Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support * Update Makefile * Update dataproc image version * Scale down the dataproc version, since latest dataproc is using Spark release candidate * Update pyspark version in requirements.txt * Bump scala/spark patch versions * We want to use the newer py4j jar when using spark 3 * Upgrade json4s * I now want Spark 3.1.1, since it's been released * Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method * Update pyspark as well * Don't update json4s * Try upgrading version * Fixed issue for constructing bufferspecs * Should at least be using newest one * Remove abstracts from type hints * Revert "Remove abstracts from type hints" This reverts commit 1e0d194. * Things don't go well if I don't use the same json4s version as Spark * Mixed a typeHintFieldName * See if this fixes my BlockMatrixSparsity issue * json4s can't handle a curried apply method * This works so long as the jar file is included in the libs directory * Makefile changes to support pulling elasticsearch * Use dataproc image for Spark 3.1.1 * Update patch version of dataproc image, no longer uses Spark RC * Fixed up Makefile, now correctly depends on copying the jar * Now we just check that the specified version is 7, as that's all we support * Delete build_hail_spark2, we can't support spark2 * Version checks for Scala and Spark * Updated installation docs * Spark versions warning * Update some old pysparks * [batch] Add more info to UI pages (hail-is#10070) * [batch] Add more info to UI pages * fixes * addr comment * addr comments * Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209) Bumps [jinja2](https://github.com/pallets/jinja) from 2.10.1 to 2.11.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/master/CHANGES.rst) - [Commits](pallets/jinja@2.10.1...2.11.3) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [docker][hail] update to latest pytest (hail-is#10177) * [docker][hail] update to latest pytest Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me, I suspect this is due to my using a much newer pytest. * fix many tests incorrectly using pytest * another one * remove unnecessary pip installs in service test dockerfiles * fix * [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207) * [gateway] cut out router-resolver from internal auth flow * [gateway] cut out router from internal * [datasets] add pan-ukb datasets (hail-is#10186) * add available pan-ukb datasets * add rst files for schemas * reference associated variant indices HT in the block matrix descriptions * [query] Add json warn context to `parse_json` (hail-is#10160) We don't test the logs, but I did test this manually, it works as expected. * [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199) * Fix tmp_dir default, which doesn't work for the service backend. * Fix type for tmp_dir. * [gitignore]ignore website and doc files (hail-is#10214) * Remove duplicate on_shutdown in query service Co-authored-by: jigold <jigold@users.noreply.github.com> Co-authored-by: Tim Poterba <tpoterba@broadinstitute.org> Co-authored-by: Daniel Goldstein <danielgold95@gmail.com> Co-authored-by: Dan King <daniel.zidan.king@gmail.com> Co-authored-by: John Compitello <johnc@broadinstitute.org> Co-authored-by: Christopher Vittal <cvittal@broadinstitute.org> Co-authored-by: Michael Franklin <michael@illusional.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Cummings <42842025+pwc2@users.noreply.github.com> Co-authored-by: Carolin Diaz <63973811+CDiaz96@users.noreply.github.com>
vladsavelyev
pushed a commit
that referenced
this pull request
Mar 26, 2021
* [batch] Worker cleanup (hail-is#10155) * [batch] Worker cleanup * more changes * wip * delint * additions? * fix * [query] Add `source_file_field` to `import_table` (hail-is#10164) * [query] Add `source_file_field` to `import_table` CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file. * ugh * [ci] add authorize sha and action items table to user page (hail-is#10142) * [ci] add authorize sha and action items table to user page * [ci] track review requested in addition to assigned for PR reviews * [ci] add CI dropdown with link to user page (hail-is#10163) * [batch] add more logs and do not wait for asyncgens (hail-is#10136) * [batch] add more logs and do not wait for asyncgens I think there is some unresolved issue with asyncgen shutdown that is keeping workers alive. This is not an issue in worker because worker calls sys.exit which forcibly stops execution. cc: @daniel-goldstein @jigold. * fix lint * [query-service] maybe fix event loop not initialized (hail-is#10153) * [query-service] maybe fix event loop not initialized The event loop is supposed to be initialized in the main thread. Sometimes our tests get placed in the non-main thread (always a thread named Dummy-1). Hopefully the session-scoped fixture is run in the main thread. * fix * [prometheus] add prometheus to track SLIs (hail-is#10165) * [prometheus] add prometheus to track SLIs * add wraps * [query] apply nest-asyncio as early as possible (hail-is#10158) * [query] apply nest-asyncio as early as possible * fix * [grafana] set pod fsGroup to grafana user (hail-is#10162) * fix linting errors (hail-is#10171) * [query] Remove verbose print (hail-is#10167) Looks like this got added in some dndarray work * [ci] update assignees and reviewers on PR github update (hail-is#10168) * [query-service] fix receive logic (hail-is#10159) * [query-service] fix receive logic Only one coro waits on receive now. We still error if a message is sent before we make our first response. * fix * fix * CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174) * [linting] add curlylint check for any service that renders jinja2 (hail-is#10172) * [linting] add curlylint check for any service that renders jinja2 templates * [linting] spaces not tabs * [website] fix website (hail-is#10173) * [website] fix website I build old versions of the docs and use them in new websites. This does not work for versions of the docs before I introduced the new system. In particular versions 0.2.63 and before generate old-style docs. * tutorials are templated * [ci] change mention for deploy failure (hail-is#10178) * [gateway] move ukbb routing into gateway (hail-is#10179) * [query] Fix filter intervals (keep=False) memory leak (hail-is#10182) * [query-service] remove service backend tests (hail-is#10180) They are too flaky currently due to the version issue. * [website] pass response body as kwarg (hail-is#10176) * Release 0.2.64 (hail-is#10183) * Bump version number * Updated changelog * [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181) * [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184) * [query-service] teach query service to read MTs and Ts created by Spark Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files should not be included in `listStatus` (they should always be empty anyway). Unfortunately, my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes the trailing slash. This prevented the path from matching `path`, which always ends in a `/`. * fix * [website] dont jinja render any of the batch docs (hail-is#10190) * [googlestoragefs] ignore the directory check entirely (hail-is#10185) * [googlestoragefs] ignore the directory check entirely If a file exists with the *same name as the directory we are listing*, then it must be a directory marker. It does not matter if that file is a directory or not. * Update GoogleStorageFS.scala * [ci] fix focus on slash and search job page for PRs (hail-is#10194) * [query] Improve file compatibility error (hail-is#10191) * Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189) * [query] NDArray Sum (hail-is#10187) * Attempt implementing the sum rule in Emit * Connected the python code, but not working yet * NDArrayExpression.sum is working now * Add default arg when no axis is provided * More comprehensive test * Unused imports * Use sum appropriately in linear_regression_rows_nd * Deleted extra blank line * Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions * Better assertions, with tests * Got the summation index correct * Add documentation * [website] fix resource path for non-html files in the docs (hail-is#10196) * [query] Remove tcode from primitive orderings (hail-is#10193) * [query] BlockMatrix map (hail-is#10195) * Add map, but protect users of the spark backend from writing arbitrary maps * If densify would have been a no-op, that should work * Densify and Sparsify are no-ops for now * Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar * Make the maps underscore methods * [query] Remove all uses of .tcode[Boolean] (hail-is#10198) * [ci] make test hello speak https (hail-is#10192) * [tls] make hello use tls * change pylint ignore message * [query] blanczos_pca dont do extra loading work (hail-is#10201) * Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt * Keep extra row fields from being included * Add query graceful shutdown for rolling updates (hail-is#10106) * Merge pull request #35 from populationgenomics/add-query-graceful-shutdown Add query graceful shutdown * Remove unused argument from query:on_shutdown * [auth] add more options for obtaining session id for dev credentials (hail-is#10203) * [auth] add more options for obtaining session id for dev credentials * [auth] extract userinfo query for use in both userinfo and verify_dev_credentials * remove unused import * [query] Default to Spark 3 (hail-is#10054) * Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support * Update Makefile * Update dataproc image version * Scale down the dataproc version, since latest dataproc is using Spark release candidate * Update pyspark version in requirements.txt * Bump scala/spark patch versions * We want to use the newer py4j jar when using spark 3 * Upgrade json4s * I now want Spark 3.1.1, since it's been released * Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method * Update pyspark as well * Don't update json4s * Try upgrading version * Fixed issue for constructing bufferspecs * Should at least be using newest one * Remove abstracts from type hints * Revert "Remove abstracts from type hints" This reverts commit 1e0d194. * Things don't go well if I don't use the same json4s version as Spark * Mixed a typeHintFieldName * See if this fixes my BlockMatrixSparsity issue * json4s can't handle a curried apply method * This works so long as the jar file is included in the libs directory * Makefile changes to support pulling elasticsearch * Use dataproc image for Spark 3.1.1 * Update patch version of dataproc image, no longer uses Spark RC * Fixed up Makefile, now correctly depends on copying the jar * Now we just check that the specified version is 7, as that's all we support * Delete build_hail_spark2, we can't support spark2 * Version checks for Scala and Spark * Updated installation docs * Spark versions warning * Update some old pysparks * [batch] Add more info to UI pages (hail-is#10070) * [batch] Add more info to UI pages * fixes * addr comment * addr comments * Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209) Bumps [jinja2](https://github.com/pallets/jinja) from 2.10.1 to 2.11.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/master/CHANGES.rst) - [Commits](pallets/jinja@2.10.1...2.11.3) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [docker][hail] update to latest pytest (hail-is#10177) * [docker][hail] update to latest pytest Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me, I suspect this is due to my using a much newer pytest. * fix many tests incorrectly using pytest * another one * remove unnecessary pip installs in service test dockerfiles * fix * [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207) * [gateway] cut out router-resolver from internal auth flow * [gateway] cut out router from internal * [datasets] add pan-ukb datasets (hail-is#10186) * add available pan-ukb datasets * add rst files for schemas * reference associated variant indices HT in the block matrix descriptions * [query] Add json warn context to `parse_json` (hail-is#10160) We don't test the logs, but I did test this manually, it works as expected. * [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199) * Fix tmp_dir default, which doesn't work for the service backend. * Fix type for tmp_dir. * [gitignore]ignore website and doc files (hail-is#10214) * Remove duplicate on_shutdown in query service Co-authored-by: jigold <jigold@users.noreply.github.com> Co-authored-by: Tim Poterba <tpoterba@broadinstitute.org> Co-authored-by: Daniel Goldstein <danielgold95@gmail.com> Co-authored-by: Dan King <daniel.zidan.king@gmail.com> Co-authored-by: John Compitello <johnc@broadinstitute.org> Co-authored-by: Christopher Vittal <cvittal@broadinstitute.org> Co-authored-by: Michael Franklin <michael@illusional.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Cummings <42842025+pwc2@users.noreply.github.com> Co-authored-by: Carolin Diaz <63973811+CDiaz96@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
terminationGracePeriodSeconds
to query serviceapp.on_shutdown
signal handler to wait for all asyncio tasks to complete before returning.aiohttp == 0.7.3
to address tasks being cancelled before the on_shutdown method is called: on_cleanup / on_shutdown are called after active tasks on the event loop are canceled aio-libs/aiohttp#3593Testing
n
seconds" method that slept for n seconds, and returned the value of an environment variable. This environment variable meant I could track which version of the deployment my script ran against.deploy.yaml
from thedeploy query
step of the dev deploy, adding theTEST_VALUE
environment variable with some value and saving it asnew-deploy.yaml
https://internal.hail.populationgenomics.org.au/$NAMESPACE/query/api/v1alpha/wait?duration=50
)kubectl --namespace $NAMESPACE get pod
), issue the second request to the wait method.Termination logs:
Duration endpoint