Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: new proving broker implementation #9400

Merged
merged 7 commits into from
Nov 15, 2024
Merged

feat: new proving broker implementation #9400

merged 7 commits into from
Nov 15, 2024

Conversation

alexghr
Copy link
Contributor

@alexghr alexghr commented Oct 24, 2024

Reopening of #8609, which was closed/merged by mistake. This PR is stacked on top of #9391

This PR adds ProvingBroker which implements a new interface for distributing proving jobs to workers as specified in #8495

@alexghr alexghr force-pushed the ag/refactor-avm-proof-fields branch from 6ec88c6 to ca65f95 Compare November 12, 2024 09:34
Base automatically changed from ag/refactor-avm-proof-fields to master November 12, 2024 10:55
@@ -237,3 +237,254 @@ export const ProvingRequestResultSchema = z.discriminatedUnion('type', [
result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH),
}),
]) satisfies ZodFor<ProvingRequestResult>;

export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename these once things are stable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh branding, so fancy!

* @param ProvingRequestType - The type of proof that was requested
* @param value - The result of the proof request
*/
setProvingJobResult<T extends ProvingRequestType>(id: V2ProvingJobId<T>, value: V2ProvingResult): Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next PR I will be creating a separate store for proof inputs and outputs such that they don't go over multiple hops between orchestrator and agent (details example using S3 bucket)

setProvingJobError<T extends ProvingRequestType>(id: V2ProvingJobId<T>, err: Error): Promise<void>;
}

export class InMemoryDatabase implements ProvingBrokerDatabase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next PR will add an LMDB implementation now that we have zod schema that aid in serialisation/deserialisation 🥳

Comment on lines 4 to 7
export function makeProvingJobId<T extends ProvingRequestType>(proofType: T): V2ProvingJobId<T> {
const id = randomBytes(8).toString('hex');
return `${ProvingRequestType[proofType]}:${id}` as V2ProvingJobId<T>;
}
Copy link
Contributor Author

@alexghr alexghr Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from previous implementation where ID contained the proof type. This is no longer needed since the type is used for a zod discriminated union #8609 (comment)

Copy link
Collaborator

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -237,3 +237,254 @@ export const ProvingRequestResultSchema = z.discriminatedUnion('type', [
result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH),
}),
]) satisfies ZodFor<ProvingRequestResult>;

export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh branding, so fancy!

*/
export class ProvingBroker implements ProvingJobProducer, ProvingJobConsumer {
// Each proof type gets its own queue so that agents can request specific types of proofs
// Manually create the queues to get type checking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you use mapTuple you can get the type check you're after. But that's just if you want to spend some time playing against tsc, otherwise this looks good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the comment above isn't relevant anymore. Before the merge with the zod schema each queue was strongly typed (e.g. PriorityQueue<ProvingJob<ProofType.BaseRollup>>). Manually instantiating the queues made TS exhaustively type check all the queues.
With the schema (and the infer type from it) it no longer has a template parameter and instead is an union. Not as type safe but 10000x easier to use :)

}
}

// eslint-disable-next-line require-await
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda hate this rule, should we disable it altogether? (in another PR, for sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour ➕

Comment on lines 198 to 203
if (retry && retries + 1 < this.maxRetries) {
this.logger.info(`Retrying proving job id=${id} retry=${retries + 1}`);
this.retries.set(id, retries + 1);
this.enqueueJobInternal(item);
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not storing retries on the db, right? (I'm fine with either)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the retries is ephemeral. If the broker dies it starts fresh the next time (but still populates the queues from the db)

Comment on lines 218 to 220
} else if (filter) {
this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`);
return this.getProvingJob(filter);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this bit. Why do we return any job in the filter in this method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

Correct 👍

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

You're right, this is currently unhandled so two agents can work the same job in parallel (if the broker restarts). Will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now correctly handled:

  • broker crashes and restarts losing in-memory data
  • old agent reports progress => new broker instance updates job as in progress
  • new agent reports progress => it receives a new job

The broker also now correctly handles the case where after a crash it doesn't return in-progress jobs

Comment on lines +55 to +59
// holds a copy of the database in memory in order to quickly fulfill requests
// this is fine because this broker is the only one that can modify the database
private jobsCache = new Map<V2ProvingJobId, V2ProvingJob>();
// as above, but for results
private resultsCache = new Map<V2ProvingJobId, V2ProvingJobResult>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given lmdb is already in-memory, is there a significant improvement by not having to cross to the C-land boundary and keeping these copies here?

Comment on lines 218 to 220
} else if (filter) {
this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`);
return this.getProvingJob(filter);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.

I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?

However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?

@alexghr alexghr enabled auto-merge (squash) November 14, 2024 15:37
Copy link
Collaborator

@PhilWindle PhilWindle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it

@alexghr alexghr merged commit da711bf into master Nov 15, 2024
66 checks passed
@alexghr alexghr deleted the ag/scaling-provers branch November 15, 2024 13:27
just-mitch pushed a commit that referenced this pull request Nov 16, 2024
Reopening of #8609, which was closed/merged by mistake. This PR is
stacked on top of #9391

This PR adds ProvingBroker which implements a new interface for
distributing proving jobs to workers as specified in
#8495
TomAFrench added a commit that referenced this pull request Nov 18, 2024
* master: (281 commits)
  fix: don't take down runners with faulty runner check (#10019)
  feat(docs): add transaction profiler docs (#9932)
  chore: hotfix runner wait (#10018)
  refactor: remove EnqueuedCallSimulator (#10015)
  refactor: stop calling public kernels (#9971)
  git subrepo push --branch=master noir-projects/aztec-nr
  git_subrepo.sh: Fix parent in .gitrepo file. [skip ci]
  chore: replace relative paths to noir-protocol-circuits
  git subrepo push --branch=master barretenberg
  chore: drop info to verbose in sequencer hot loop (#9983)
  refactor: Trace structure is an object (#10003)
  refactor: enqueued calls processor -> public tx simulator (#9919)
  chore: World state tech debt cleanup 1 (#9561)
  chore(ci): run noir tests in parallel to building e2e tests (#9977)
  Revert "chore: lower throughput of ebs disks" (#9996)
  feat: new proving broker implementation (#9400)
  chore: replace `to_radix` directive with brillig (#9970)
  chore: disable failing 48validator kind test (#9920)
  test: prove one epoch in kind (#9886)
  fix: formatting (#9979)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants