Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(proof_data_handler): Unlock jobs on transient errors #2486

Merged
merged 2 commits into from
Jul 26, 2024

Conversation

popzxc
Copy link
Member

@popzxc popzxc commented Jul 25, 2024

What ❔

Currently, proof data handler locks the job for proof generation, and starts fetching required data after that.
If any error happens during fetching, the method will err, and the job will remain locked.

This PR changes it, so that if any error occurs, we unlock the job before we return an error.

Additionally, it reduces the amount of non-necessary panics in the touched code, and adds some docs.

Why ❔

Correctness & efficiency.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.

@popzxc popzxc requested review from EmilLuta and Artemka374 July 25, 2024 09:27
@popzxc popzxc enabled auto-merge July 26, 2024 03:53
@popzxc popzxc added this pull request to the merge queue Jul 26, 2024
Merged via the queue into main with commit 7c336b1 Jul 26, 2024
47 checks passed
@popzxc popzxc deleted the popzxc-proof-data-handler-unpick-jobs branch July 26, 2024 04:37
github-merge-queue bot pushed a commit that referenced this pull request Aug 1, 2024
🤖 I have created a release *beep* *boop*
---


##
[24.13.0](core-v24.12.0...core-v24.13.0)
(2024-07-31)


### Features

* Add recovery tests to zk_supervisor
([#2444](#2444))
([0c0d10a](0c0d10a))
* Added a JSON RPC to simulating L1 for consensus attestation
([#2480](#2480))
([c6b3adf](c6b3adf))
* added dropping all attester certificates when doing hard fork
([#2529](#2529))
([5acd686](5acd686))
* **configs:** Do not panic if config is only partially filled
([#2545](#2545))
([db13fe3](db13fe3))
* **eth-sender:** Make eth-sender tests use blob txs + refactor of
eth-sender tests
([#2316](#2316))
([c8c8334](c8c8334))
* Introduce more tracing instrumentation
([#2523](#2523))
([79d407a](79d407a))
* Remove unused VKs, add docs for BWG
([#2468](#2468))
([2fa6bf0](2fa6bf0))
* Revisit base config values
([#2532](#2532))
([3fac8ac](3fac8ac))
* Server 10k gwei limit on gas price and 1M limit on pubdata price
([#2460](#2460))
([be238cc](be238cc))
* **vlog:** Implement otlp guard with force flush on drop
([#2536](#2536))
([c9f76e5](c9f76e5))
* **vlog:** New vlog interface + opentelemtry improvements
([#2472](#2472))
([c0815cd](c0815cd))
* **zk_toolbox:** add test upgrade subcommand to zk_toolbox
([#2515](#2515))
([1a12f5f](1a12f5f))
* **zk_toolbox:** use configs from the main repo
([#2470](#2470))
([4222d13](4222d13))


### Bug Fixes

* **contract verifier:** Fix config values
([#2510](#2510))
([3729468](3729468))
* fixed panic propagation
([#2525](#2525))
([e0fc58b](e0fc58b))
* **proof_data_handler:** Unlock jobs on transient errors
([#2486](#2486))
([7c336b1](7c336b1))
* **prover:** Parallelize circuit metadata uploading for BWG
([#2520](#2520))
([f49720f](f49720f))
* VM performance diff: don't show 0 as N/A
([#2276](#2276))
([2fa2249](2fa2249))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: zksync-era-bot <zksync-era-bot@users.noreply.github.com>
@pbeza
Copy link
Collaborator

pbeza commented Aug 21, 2024

@popzxc I think you forgot to update this diagram. I prepared a PR fixing it: #2722.

BTW, I might be repeating myself, but unlike @perekopskiy, I think we should use an ENUM for the status in the db schema itself (instead of TEXT). Right now, we just have an enum in the code here. Having it in the SQL schema too makes it easier to understand and work with the business logic.

pbeza added a commit that referenced this pull request Aug 21, 2024
Also, cherry-pick some of the ideas introduced in
#2486
pbeza added a commit that referenced this pull request Aug 22, 2024
Also, cherry-pick some of the ideas introduced in
#2486
pbeza added a commit that referenced this pull request Aug 23, 2024
It should have been updated as part of these 2 PRs:
- #2258
- #2486
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2024
It should have been updated as part of these 2 PRs:
- #2258
- #2486

## What ❔

Update ProofGenerationDal docs chart.

## Why ❔

We like up-to-date docs.

## Checklist

- [x] PR title corresponds to the body of PR (we generate changelog
entries from PRs).
- [ ] Tests for the changes have been added / updated.
- [x] Documentation comments have been added / updated.
- [x] Code has been formatted via `zk fmt` and `zk lint`.
@popzxc
Copy link
Member Author

popzxc commented Aug 27, 2024

@pbeza thanks for looking into it!

Regarding enums and stronger typing: I don't think @perekopskiy meant that "enums bad, strong typing bad" -- he raised a valid question at the beginning -- what's the cost of this solution? For the prover DB it's even more important. A few immediate things that come to mind:

  • we will have to have a blocking migration to introduce it (and I'm not sure if it will be quick). Alternatively, we will have to have multi-step migration, which is always tricky.
  • each time we want to add a new state (and we might! especially with the external provers that we have in mind), we will have to do a migration.
  • the ergonomics of using enums in SQLx are questionable, and it forces queries to look more dynamic than they in fact are.
  • it's some work that has to be done, and we're always at capacity. Time doing it can be spent on something more urgent.

On the other hand, what are the drawbacks of having it in text? The main one is the possibility of typo/error/using non-existent status. It is a valid concern, but in that particular case it's easily discoverable (each typo will inevitably cause the prover to deadlock). Moreover, we have DAL as a layer of isolation, so the surface for bugs is pretty narrow.

Just in case, I as well don't want to say "enums bad, strong typing bad". What I want to say is that everything is a tradeoff and the answer is usually "it depends". If you have strong opinions -- you are more than welcome to write a design doc, compare pros/cons of both approaches, and present it -- you will be heard, and your proposal will be discussed. But it must come with a good motivation to change something that works and works well (even if not fully idiomatic).

@pbeza
Copy link
Collaborator

pbeza commented Aug 27, 2024

Thanks for the detailed explnation!

Just to clarify my PoV: I think you might be overlooking the human factor in coding. You, @popzxc, and @perekopskiy are likely very familiar and comfortable with the codebase, aware of many design decisions made over the last months or years. However, if someone is new to the code, having status as TEXT feels unnatural unless there's clear explanation in front of them.

One of the key reasons why using ENUMs would be a more natural choice here is that it makes the code easier to reason about. Using TEXT introduces the temptation to add new statuses in an ad hoc manner, rather than properly engineering a fully-fledged solution. In fact, we're about to introduce a new status responsible for permanently skipping batches (cc @thomasknauth). This status will only be represented in Rust's enum, not in the DB's ENUM, which may be sufficient.

That said, I get why you went with TEXT instead of ENUM. I worked on a project once where they made similar trade-offs with the database. The database setup wasn’t very idiomatic either, but it was a performance trade-off in that case.

Anyway, thanks for the thoughtful discussion! ⚔️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants