New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: new proving broker implementation #9400

Merged

alexghr merged 7 commits into master from ag/scaling-provers

Nov 15, 2024

Contributor

alexghr commented Oct 24, 2024 •

edited

Loading

Reopening of #8609, which was closed/merged by mistake. This PR is stacked on top of #9391

This PR adds ProvingBroker which implements a new interface for distributing proving jobs to workers as specified in #8495

alexghr force-pushed the ag/refactor-avm-proof-fields branch from 6ec88c6 to ca65f95 Compare

November 12, 2024 09:34

alexghr force-pushed the ag/scaling-provers branch from 818ea3b to 232c3bc Compare

November 12, 2024 10:49

Base automatically changed from ag/refactor-avm-proof-fields to master

November 12, 2024 10:55

alexghr force-pushed the ag/scaling-provers branch from 232c3bc to cbb6759 Compare

November 12, 2024 11:43

alexghr added 3 commits

November 13, 2024 11:29


          fix: add deserialisation methods

155a617


          feat: add ProvingBroker

053441e


          refactor: use new zod schema

d309d3a

alexghr force-pushed the ag/scaling-provers branch from cbb6759 to d309d3a Compare

November 13, 2024 11:29

alexghr commented

View reviewed changes

yarn-project/circuit-types/src/interfaces/proving-job.ts Outdated

                   result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH),
                 }),
               ]) satisfies ZodFor<ProvingRequestResult>;
+              export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId');

Contributor Author

alexghr Nov 13, 2024

Will rename these once things are stable

Collaborator

spalladino Nov 13, 2024

Ohh branding, so fancy!

yarn-project/prover-client/src/proving_broker/proving_broker_database.ts Outdated

+                 * @param ProvingRequestType - The type of proof that was requested
+                 * @param value - The result of the proof request
+                 */
+                setProvingJobResult<T extends ProvingRequestType>(id: V2ProvingJobId<T>, value: V2ProvingResult): Promise<void>;

Contributor Author

alexghr Nov 13, 2024

In the next PR I will be creating a separate store for proof inputs and outputs such that they don't go over multiple hops between orchestrator and agent (details example using S3 bucket)

yarn-project/prover-client/src/proving_broker/proving_broker_database.ts

+                setProvingJobError<T extends ProvingRequestType>(id: V2ProvingJobId<T>, err: Error): Promise<void>;
+              }
+              export class InMemoryDatabase implements ProvingBrokerDatabase {

Contributor Author

alexghr Nov 13, 2024

Next PR will add an LMDB implementation now that we have zod schema that aid in serialisation/deserialisation 🥳

yarn-project/prover-client/src/proving_broker/proving_job.ts Outdated

Comment on lines 4 to 7

+              export function makeProvingJobId<T extends ProvingRequestType>(proofType: T): V2ProvingJobId<T> {
+                const id = randomBytes(8).toString('hex');
+                return `${ProvingRequestType[proofType]}:${id}` as V2ProvingJobId<T>;
+              }

Contributor Author

alexghr Nov 13, 2024 •

edited

Loading

Leftover from previous implementation where ID contained the proof type. This is no longer needed since the type is used for a zod discriminated union #8609 (comment)


          refactor: remove template parameter from proving job id

1a729d1

spalladino approved these changes

View reviewed changes

Collaborator

spalladino left a comment

LGTM!

yarn-project/circuit-types/src/interfaces/proving-job.ts Outdated

                   result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH),
                 }),
               ]) satisfies ZodFor<ProvingRequestResult>;
+              export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId');

Collaborator

spalladino Nov 13, 2024

Ohh branding, so fancy!

yarn-project/prover-client/src/proving_broker/proving_broker.ts Outdated

+               */
+              export class ProvingBroker implements ProvingJobProducer, ProvingJobConsumer {
+                // Each proof type gets its own queue so that agents can request specific types of proofs
+                // Manually create the queues to get type checking

Collaborator

spalladino Nov 13, 2024

I think that if you use mapTuple you can get the type check you're after. But that's just if you want to spend some time playing against tsc, otherwise this looks good.

Contributor Author

alexghr Nov 14, 2024

Ah, the comment above isn't relevant anymore. Before the merge with the zod schema each queue was strongly typed (e.g. PriorityQueue<ProvingJob<ProofType.BaseRollup>>). Manually instantiating the queues made TS exhaustively type check all the queues.
With the schema (and the infer type from it) it no longer has a template parameter and instead is an union. Not as type safe but 10000x easier to use :)

yarn-project/prover-client/src/proving_broker/proving_broker.ts

+                  }
+                }
+                // eslint-disable-next-line require-await

Collaborator

spalladino Nov 13, 2024

I kinda hate this rule, should we disable it altogether? (in another PR, for sure)

Contributor Author

alexghr Nov 14, 2024

I'm in favour ➕

yarn-project/prover-client/src/proving_broker/proving_broker.ts Outdated

Comment on lines 198 to 203

+                  if (retry && retries + 1 < this.maxRetries) {
+                    this.logger.info(`Retrying proving job id=${id} retry=${retries + 1}`);
+                    this.retries.set(id, retries + 1);
+                    this.enqueueJobInternal(item);
+                    return;
+                  }

Collaborator

spalladino Nov 13, 2024

We are not storing retries on the db, right? (I'm fine with either)

Contributor Author

alexghr Nov 14, 2024

Correct, the retries is ephemeral. If the broker dies it starts fresh the next time (but still populates the queues from the db)

yarn-project/prover-client/src/proving_broker/proving_broker.ts Outdated

Comment on lines 218 to 220

+                  } else if (filter) {
+                    this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`);
+                    return this.getProvingJob(filter);

Collaborator

spalladino Nov 13, 2024

Not sure I understand this bit. Why do we return any job in the filter in this method?

Collaborator

PhilWindle Nov 13, 2024

I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

Contributor Author

alexghr Nov 14, 2024

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

Correct 👍

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

You're right, this is currently unhandled so two agents can work the same job in parallel (if the broker restarts). Will fix.

Contributor Author

alexghr Nov 14, 2024

This is now correctly handled:

broker crashes and restarts losing in-memory data
old agent reports progress => new broker instance updates job as in progress
new agent reports progress => it receives a new job

The broker also now correctly handles the case where after a crash it doesn't return in-progress jobs

yarn-project/prover-client/src/proving_broker/proving_broker.ts Show resolved Hide resolved

yarn-project/prover-client/src/proving_broker/proving_broker.ts

Comment on lines +55 to +59

+                // holds a copy of the database in memory in order to quickly fulfill requests
+                // this is fine because this broker is the only one that can modify the database
+                private jobsCache = new Map<V2ProvingJobId, V2ProvingJob>();
+                // as above, but for results
+                private resultsCache = new Map<V2ProvingJobId, V2ProvingJobResult>();

Collaborator

spalladino Nov 13, 2024

Given lmdb is already in-memory, is there a significant improvement by not having to cross to the C-land boundary and keeping these copies here?

PhilWindle requested changes

View reviewed changes

yarn-project/prover-client/src/proving_broker/proving_broker.test.ts Outdated Show resolved Hide resolved

yarn-project/prover-client/src/proving_broker/proving_broker.ts Outdated

Comment on lines 218 to 220

+                  } else if (filter) {
+                    this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`);
+                    return this.getProvingJob(filter);

Collaborator

PhilWindle Nov 13, 2024

I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

This was linked to issues Nov 13, 2024

[Prover] New ProvingBroker #9529

Closed

[Prover] ProvingBroker restores persisted jobs and proofs #9530

Closed

alexghr added 2 commits

November 14, 2024 11:36


          fix: handle two agents working the same proof

98e09f7


          fix: don't modify container while iterating

6d28c69

alexghr force-pushed the ag/scaling-provers branch from 686e693 to 6d28c69 Compare

November 14, 2024 13:41

alexghr enabled auto-merge (squash)

November 14, 2024 15:37

PhilWindle approved these changes

View reviewed changes

Collaborator

PhilWindle left a comment

Like it


          Merge branch 'master' into ag/scaling-provers

b09ec3c

alexghr merged commit da711bf into master

66 checks passed

alexghr deleted the ag/scaling-provers branch

November 15, 2024 13:27

AztecBot mentioned this pull request

chore(master): Release 0.63.0 #9651

Merged

just-mitch pushed a commit that referenced this pull request


          feat: new proving broker implementation (#9400)

428738b

Reopening of #8609, which was closed/merged by mistake. This PR is
stacked on top of #9391

This PR adds ProvingBroker which implements a new interface for
distributing proving jobs to workers as specified in
#8495

alexghr linked an issue

that may be closed by this pull request

[Prover] ProvingBroker resolves conflicts when same jobs is worked on by two agents #9531

Closed

TomAFrench added a commit that referenced this pull request


          Merge branch 'master' into sync-noir

4d49f2b

* master: (281 commits)
  fix: don't take down runners with faulty runner check (#10019)
  feat(docs): add transaction profiler docs (#9932)
  chore: hotfix runner wait (#10018)
  refactor: remove EnqueuedCallSimulator (#10015)
  refactor: stop calling public kernels (#9971)
  git subrepo push --branch=master noir-projects/aztec-nr
  git_subrepo.sh: Fix parent in .gitrepo file. [skip ci]
  chore: replace relative paths to noir-protocol-circuits
  git subrepo push --branch=master barretenberg
  chore: drop info to verbose in sequencer hot loop (#9983)
  refactor: Trace structure is an object (#10003)
  refactor: enqueued calls processor -> public tx simulator (#9919)
  chore: World state tech debt cleanup 1 (#9561)
  chore(ci): run noir tests in parallel to building e2e tests (#9977)
  Revert "chore: lower throughput of ebs disks" (#9996)
  feat: new proving broker implementation (#9400)
  chore: replace `to_radix` directive with brillig (#9970)
  chore: disable failing 48validator kind test (#9920)
  test: prove one epoch in kind (#9886)
  fix: formatting (#9979)
  ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet