-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: new proving broker implementation #9400
Conversation
6ec88c6
to
ca65f95
Compare
818ea3b
to
232c3bc
Compare
232c3bc
to
cbb6759
Compare
cbb6759
to
d309d3a
Compare
@@ -237,3 +237,254 @@ export const ProvingRequestResultSchema = z.discriminatedUnion('type', [ | |||
result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH), | |||
}), | |||
]) satisfies ZodFor<ProvingRequestResult>; | |||
|
|||
export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will rename these once things are stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh branding, so fancy!
* @param ProvingRequestType - The type of proof that was requested | ||
* @param value - The result of the proof request | ||
*/ | ||
setProvingJobResult<T extends ProvingRequestType>(id: V2ProvingJobId<T>, value: V2ProvingResult): Promise<void>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next PR I will be creating a separate store for proof inputs and outputs such that they don't go over multiple hops between orchestrator and agent (details example using S3 bucket)
setProvingJobError<T extends ProvingRequestType>(id: V2ProvingJobId<T>, err: Error): Promise<void>; | ||
} | ||
|
||
export class InMemoryDatabase implements ProvingBrokerDatabase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next PR will add an LMDB implementation now that we have zod schema that aid in serialisation/deserialisation 🥳
export function makeProvingJobId<T extends ProvingRequestType>(proofType: T): V2ProvingJobId<T> { | ||
const id = randomBytes(8).toString('hex'); | ||
return `${ProvingRequestType[proofType]}:${id}` as V2ProvingJobId<T>; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from previous implementation where ID contained the proof type. This is no longer needed since the type is used for a zod discriminated union #8609 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -237,3 +237,254 @@ export const ProvingRequestResultSchema = z.discriminatedUnion('type', [ | |||
result: schemaForRecursiveProofAndVerificationKey(TUBE_PROOF_LENGTH), | |||
}), | |||
]) satisfies ZodFor<ProvingRequestResult>; | |||
|
|||
export const V2ProvingJobIdSchema = z.custom<`${ProvingRequestType}:${string}`>().brand('ProvingJobId'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh branding, so fancy!
*/ | ||
export class ProvingBroker implements ProvingJobProducer, ProvingJobConsumer { | ||
// Each proof type gets its own queue so that agents can request specific types of proofs | ||
// Manually create the queues to get type checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you use mapTuple
you can get the type check you're after. But that's just if you want to spend some time playing against tsc, otherwise this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the comment above isn't relevant anymore. Before the merge with the zod schema each queue was strongly typed (e.g. PriorityQueue<ProvingJob<ProofType.BaseRollup>>
). Manually instantiating the queues made TS exhaustively type check all the queues.
With the schema (and the infer type from it) it no longer has a template parameter and instead is an union. Not as type safe but 10000x easier to use :)
} | ||
} | ||
|
||
// eslint-disable-next-line require-await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda hate this rule, should we disable it altogether? (in another PR, for sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour ➕
if (retry && retries + 1 < this.maxRetries) { | ||
this.logger.info(`Retrying proving job id=${id} retry=${retries + 1}`); | ||
this.retries.set(id, retries + 1); | ||
this.enqueueJobInternal(item); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not storing retries on the db, right? (I'm fine with either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the retries is ephemeral. If the broker dies it starts fresh the next time (but still populates the queues from the db)
} else if (filter) { | ||
this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`); | ||
return this.getProvingJob(filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this bit. Why do we return any job in the filter in this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.
I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?
However, I would have thought we would want some more logic here. If metadata
contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?
Correct 👍
However, I would have thought we would want some more logic here. If metadata contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?
You're right, this is currently unhandled so two agents can work the same job in parallel (if the broker restarts). Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now correctly handled:
- broker crashes and restarts losing in-memory data
- old agent reports progress => new broker instance updates job as in progress
- new agent reports progress => it receives a new job
The broker also now correctly handles the case where after a crash it doesn't return in-progress jobs
// holds a copy of the database in memory in order to quickly fulfill requests | ||
// this is fine because this broker is the only one that can modify the database | ||
private jobsCache = new Map<V2ProvingJobId, V2ProvingJob>(); | ||
// as above, but for results | ||
private resultsCache = new Map<V2ProvingJobId, V2ProvingJobResult>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given lmdb is already in-memory, is there a significant improvement by not having to cross to the C-land boundary and keeping these copies here?
yarn-project/prover-client/src/proving_broker/proving_broker.test.ts
Outdated
Show resolved
Hide resolved
} else if (filter) { | ||
this.logger.warn(`Proving job id=${id} not found in the in-progress set. Sending new one`); | ||
return this.getProvingJob(filter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because this design can result in more than 1 agent proving a job. No information is persisted relating to a job currently being proven. This is to drastically reduce DB writes. The result is that after a restart, the broker doesn't know which jobs are in progress and could re-issue them.
I think the idea here is that if I was proving job 'x' and it is not longer in progress here then it must have been completed by somebody else and I get given a new job. Is that right @alexghr?
However, I would have thought we would want some more logic here. If metadata
contained an id of who has the job then we can see if somebody else has the same job before it's finished couldn't we?
686e693
to
6d28c69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it
* master: (281 commits) fix: don't take down runners with faulty runner check (#10019) feat(docs): add transaction profiler docs (#9932) chore: hotfix runner wait (#10018) refactor: remove EnqueuedCallSimulator (#10015) refactor: stop calling public kernels (#9971) git subrepo push --branch=master noir-projects/aztec-nr git_subrepo.sh: Fix parent in .gitrepo file. [skip ci] chore: replace relative paths to noir-protocol-circuits git subrepo push --branch=master barretenberg chore: drop info to verbose in sequencer hot loop (#9983) refactor: Trace structure is an object (#10003) refactor: enqueued calls processor -> public tx simulator (#9919) chore: World state tech debt cleanup 1 (#9561) chore(ci): run noir tests in parallel to building e2e tests (#9977) Revert "chore: lower throughput of ebs disks" (#9996) feat: new proving broker implementation (#9400) chore: replace `to_radix` directive with brillig (#9970) chore: disable failing 48validator kind test (#9920) test: prove one epoch in kind (#9886) fix: formatting (#9979) ...
Reopening of #8609, which was closed/merged by mistake. This PR is stacked on top of #9391
This PR adds ProvingBroker which implements a new interface for distributing proving jobs to workers as specified in #8495