-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tee-prover): mitigate panic on redeployments #2764
fix(tee-prover): mitigate panic on redeployments #2764
Conversation
We experienced a tee-prover panic, likely due to the automatic redeployment of the proof-data-handler in the staging environment. We've been getting 503 errors for an extended period when trying to reach http://server-v2-proof-data-handler-internal.stage.matterlabs.corp/tee/proof_input, which resulted in a panic after reaching the retry limit. Relevant code causing the panic: https://github.com/matter-labs/zksync-era/blob/8ed086afecfcad30bfda44fc4d29a00beea71cca/core/bin/zksync_tee_prover/src/tee_prover.rs#L202 Relevant logs: https://grafana.matterlabs.dev/explore?schemaVersion=1&panes=%7B%223ss%22:%7B%22datasource%22:%22cduazndivuosga%22,%22queries%22:%5B%7B%22metrics%22:%5B%7B%22id%22:%221%22,%22type%22:%22logs%22%7D%5D,%22query%22:%22container_name:%5C%22zksync-tee-prover%5C%22%22,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22quickwit-quickwit-datasource%22,%22uid%22:%22cduazndivuosga%22%7D,%22alias%22:%22%22,%22bucketAggs%22:%5B%7B%22type%22:%22date_histogram%22,%22id%22:%222%22,%22settings%22:%7B%22interval%22:%22auto%22%7D,%22field%22:%22%22%7D%5D,%22timeField%22:%22%22%7D%5D,%22range%22:%7B%22from%22:%221724854712742%22,%22to%22:%221724855017388%22%7D%7D%7D&orgId=1
With this code change, seconds, which is 8 minutes and 31 seconds if my math is correct. :P Not sure if just increasing the number of retries is good enough, though. I welcome any other suggestions. Perhaps we can allow panics during redeployments? |
6cb1a66
to
37af465
Compare
@popzxc, PTAL. I’ve addressed all your comments. I think the failing tests are just being flaky. |
With (teepot PR 196)[matter-labs/teepot#196] merged, update the `flake.lock` for `teepot` to use the `--env-prefix` argument for `tee-key-preexec`. This aligns the environment variable names, which were changed in #2764 Signed-off-by: Harald Hoyer <harald@matterlabs.dev>
…2789) ## What ❔ With matter-labs/teepot#196 merged, update the `flake.lock` for `teepot` to use the `--env-prefix` argument for `tee-key-preexec`. ## Why ❔ This aligns the environment variable names, which were changed in #2764 ## Checklist <!-- Check your PR fulfills the following items. --> <!-- For draft PRs check the boxes as you complete them. --> - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [ ] Tests for the changes have been added / updated. - [ ] Documentation comments have been added / updated. - [ ] Code has been formatted via `zk fmt` and `zk lint`. Signed-off-by: Harald Hoyer <harald@matterlabs.dev>
🤖 I have created a release *beep* *boop* --- ## [24.24.0](core-v24.23.0...core-v24.24.0) (2024-09-05) ### Features * conditional cbt l1 updates ([#2748](#2748)) ([6d18061](6d18061)) * **eth-watch:** do not query events from earliest block ([#2810](#2810)) ([1da3f7e](1da3f7e)) * **genesis:** Validate genesis config against L1 ([#2786](#2786)) ([b2dd9a5](b2dd9a5)) * Integrate tracers and implement circuits tracer in vm2 ([#2653](#2653)) ([87b02e3](87b02e3)) * Move prover data to /home/popzxc/workspace/current/zksync-era/prover/data ([#2778](#2778)) ([62e4d46](62e4d46)) * Remove prover db from house keeper ([#2795](#2795)) ([85b7346](85b7346)) * **vm-runner:** Implement batch data prefetching ([#2724](#2724)) ([d01840d](d01840d)) * **vm:** Extract batch executor to separate crate ([#2702](#2702)) ([b82dfa4](b82dfa4)) * **vm:** Simplify VM interface ([#2760](#2760)) ([c3bde47](c3bde47)) * **zk_toolbox:** add multi-chain CI integration test ([#2594](#2594)) ([05c940e](05c940e)) ### Bug Fixes * **config:** Do not panic for observability config ([#2639](#2639)) ([1e768d4](1e768d4)) * **core:** Batched event processing support for Reth ([#2623](#2623)) ([958dfdc](958dfdc)) * return correct witness inputs ([#2770](#2770)) ([2516e2e](2516e2e)) * **tee-prover:** increase retries to reduce spurious alerts ([#2776](#2776)) ([4fdc806](4fdc806)) * **tee-prover:** mitigate panic on redeployments ([#2764](#2764)) ([178b386](178b386)) * **tee:** lowercase enum TEE types ([#2798](#2798)) ([0f2f9bd](0f2f9bd)) * **vm-runner:** Fix statement timeouts in VM playground ([#2772](#2772)) ([d3cd553](d3cd553)) ### Performance Improvements * **vm:** Fix VM performance regression on CI loadtest ([#2782](#2782)) ([bc0d7d5](bc0d7d5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: zksync-era-bot <zksync-era-bot@users.noreply.github.com>
What ❔
We experienced
tee-prover
panic, likely due to the automatic redeployment of theproof-data-handler
in thestaging
environment. We've been getting503 Service Unavailable
errors for an extended period when trying to reach http://server-v2-proof-data-handler-internal.stage.matterlabs.corp/tee/proof_input, which resulted in a panic after reaching the retry limit.Relevant code causing the panic:
zksync-era/core/bin/zksync_tee_prover/src/tee_prover.rs
Lines 201 to 203 in 8ed086a
Relevant logs.
Why ❔
To mitigate panics on
proof-data-handler
redeployments.Checklist
zk fmt
andzk lint
.