Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

Open
Tracked by #1138
GitGab19 opened this issue May 21, 2024 · 9 comments
Assignees
Labels
bug Something isn't working job-declarator-client

Comments

@GitGab19
Copy link
Collaborator

GitGab19 commented May 21, 2024

While working on #901, I noticed that there is a particular case in which JDC stops receiving valid shares from Translator. So it does not send them anymore to pool.
This is the flow which causes this issue:

  • NewTemplate1
  • SetNewPrevHash1
  • DeclareMiningJob1 (future job)
  • a valid share1 if found and submitted from Translator to JDC
  • NewTemplate2
  • SetNewPrevHash2
  • ProvideMissingTransactions1
  • ProvideMissingTransactionsSuccess1
  • DeclareMiningJobSuccess1
  • DeclareMiningJob2 (a new future job)
  • ....
  • DeclareMiningJobSucces2
  • SetCustomMiningJob2

After this, no more valid shares are received from JDC, even if they are found by Translator.

I started to dig into it, and I'm pretty sure the point where JDC gets stuck is in downstream.rs.
The reason is that UpstreamMiningNode::get_job_id never return in this case (because there's a loop in there) since the job_id is not inserted through this line. The reason is that for that specific job, a SetCustomMiningJob1 was not sent (because in the meantime a new prev_hash and a new template has arrived, look at the messages flow).

@GitGab19
Copy link
Collaborator Author

I pushed a timeout as a temporary fix for this, but a more robust solution to this has to be found.
@Fi3 gave me some suggestions privately. Maybe you can explain better them here? 🙏

@pavlenex pavlenex added this to the 1.0.2 milestone May 21, 2024
@xyephy
Copy link
Contributor

xyephy commented May 30, 2024

Hi @GitGab19, how would I reproduce the issue?
Also might you have any new findings that might help me in solving this?

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Jun 3, 2024

Hi @GitGab19, how would I reproduce the issue? Also might you have any new findings that might help me in solving this?

We are working on adding a MG test which will be able to reproduce this case.
However, I already described the set of messages which leads to this issue here: #920 (comment)

@plebhash plebhash self-assigned this Jul 2, 2024
@plebhash
Copy link
Collaborator

plebhash commented Jul 2, 2024

hey @xyephy what's your status on this?

we have this marked as high priority on our Project Board (milestone 1.0.2), so in case you're not able to keep going, I'll take this over

please let me know if you have already done some progress that could accelerate my efforts here

@plebhash
Copy link
Collaborator

plebhash commented Jul 8, 2024

  • NewTemplate1
  • SetNewPrevHash1
  • DeclareMiningJob1 (future job)
  • a valid share1 if found and submitted from Translator to JDC
  • NewTemplate2
  • SetNewPrevHash2
  • ProvideMissingTransactions1
  • ProvideMissingTransactionsSuccess1
  • DeclareMiningJobSuccess1
  • DeclareMiningJob2 (a new future job)
  • ...
  • DeclareMiningJobSucces2
  • SetCustomMiningJob2

@GitGab19 how is share1 found before DeclareMiningJobSuccess1?

is that share from a previous job (e.g.: 0, not 1)? or is there some mistake there?

I have the impression that you just wanted to highlight that everything is working as expected up until this point in the message flow, but I want to be sure.

@plebhash
Copy link
Collaborator

plebhash commented Jul 9, 2024

if we realize this will take too much time/effort, we should consider moving it to milestone 1.0.3

@plebhash
Copy link
Collaborator

Today I had a call with @rrybarczyk + @Shourya742 where we deep-dived into this issue and started drafting a visual diagram with the message flow of this issue (this visual diagram should be seen as a WIP and might contain inaccuracies).

We left the call feeling quite confident about our progress, but I'm revisiting our diagram a few hours later with fresh eyes and I now I realize we need to do a sanity check on some of the assumptions we made during the call.

What really stands out to me right now is that we assumed that a SetCustomMiningJob.Success is a prerequisite for JDC to send a NewExtendedMiningJob downstream. We used this assumption as the foundation for how we ordered these messages in the diagram.

However, if we attentively read the comments on JDC's main.rs, it tells a different story:

/// Main loop:
/// 1. TemplateRx: <-NewTemplate, SetNewPrevHash
/// 2. JobDeclarator: -> CommitMiningJob (JobDeclarator::on_new_template), <-CommitMiningJobSuccess
/// 3. Upstream: ->SetCustomMiningJob, Downstream: ->NewExtendedMiningJob, ->SetNewPrevHash
/// 4. Downstream: <-Share
/// 5. Upstream: ->Share
///
/// When we have a NewTemplate we send the NewExtendedMiningJob downstream and the CommitMiningJob
/// to the JDS altoghether.
/// Then we receive CommitMiningJobSuccess and we use the new token to send SetCustomMiningJob to
/// the pool.
/// When we receive SetCustomMiningJobSuccess we set in Upstream job_id equal to the one received
/// in SetCustomMiningJobSuccess so that we sill send shares upstream with the right job_id.
///
/// The above procedure, let us send NewExtendedMiningJob downstream right after a NewTemplate has
/// been received this will reduce the time that pass from a NewTemplate and the mining-device
/// starting to mine on the new job.
///
/// In the case a future NewTemplate the SetCustomMiningJob is sent only if the canditate become
/// the actual NewTemplate so that we do not send a lot of usless future Job to the pool. That
/// means that SetCustomMiningJob is sent only when a NewTemplate becom "active"
///
/// The JobDeclarator always have 2 avaiable token, that means that whenever a token is used to
/// commit a job with upstream we require a new one. Having always a token when needed means that
/// whenever we want to commit a mining job we can do that without waiting for upstream to provide
/// a new token.

This is also connected to the original issue description by @GitGab19 when he said:

The reason is that for that specific job, a SetCustomMiningJob was not sent (because in the meantime a new prev_hash and a new template has arrived)

This suggests that the root cause is related to SetCustomMiningJob never being sent, while during our call we believed that the root cause was a delayed SetCustomMiningJob.Success response from the pool.

So our assumption that a SetCustomMiningJob.Success being a prerequisite for JDC to send a NewExtendedMiningJob downstream is not true.

And I don't know if this is a feature or a bug.

We will definitely need more calls to deep dive into this issue and refine the visual diagram before we even start writing MG tests or thinking about a long-term bugfix.

And we should definitely move this issue away from 1.0.2, because we will not be able to finish this in the immediate future (cc @pavlenex ).


Note:

SV2 message ordering is a very dense topic. We are dealing with scenarios of distributed computing happening across 5 different entities, each doing fundamentally different things.

It is strategically important to have frequent collective exercises where we go through the message flow while regularly doing sanity checks on each other.

These exercises should allow SRI contributors to get a solid foundation on SV2 message flows and also spread/decentralize this knowledge across the community.

@Fi3
Copy link
Collaborator

Fi3 commented Jul 12, 2024

This is definitely a feature you want to start mining on a job as soon as you know the job. You assume that the pool is going to accept the job (that is what it should always happen if everything works). See also my comment here stratum-mining/sv2-spec#80 (reply in thread)

@plebhash plebhash removed this from the 1.0.2 milestone Jul 15, 2024
@GitGab19
Copy link
Collaborator Author

  • NewTemplate1
  • SetNewPrevHash1
  • DeclareMiningJob1 (future job)
  • a valid share1 if found and submitted from Translator to JDC
  • NewTemplate2
  • SetNewPrevHash2
  • ProvideMissingTransactions1
  • ProvideMissingTransactionsSuccess1
  • DeclareMiningJobSuccess1
  • DeclareMiningJob2 (a new future job)
  • ...
  • DeclareMiningJobSucces2
  • SetCustomMiningJob2

@GitGab19 how is share1 found before DeclareMiningJobSuccess1?

is that share from a previous job (e.g.: 0, not 1)? or is there some mistake there?

I have the impression that you just wanted to highlight that everything is working as expected up until this point in the message flow, but I want to be sure.

We already discussed it during last "diagramming" call, but I write it here just for the record.
Share1 is found in that point in the flow because as @Fi3 said the miner starts to mine on the new job as soon as a new job is created by JDC. And in the meantime all the JD process is executed.

I wanted to hightlight that, because it seems that since the job to which the share1 is belonging has never been communicated to the pool (through SetCustomMiningJob1), everything starts to break as described by me in the issue description.

@pavlenex pavlenex added this to the 1.0.3 milestone Jul 16, 2024
@plebhash plebhash removed their assignment Aug 19, 2024
@plebhash plebhash removed this from the 1.0.3 milestone Aug 20, 2024
xyephy added a commit to xyephy/stratum that referenced this issue Aug 26, 2024
This PR addresses issue stratum-mining#920, where JDC  stops receiving valid shares from the Translator when two new blocks are found within a few seconds. The problem occurs due to a race condition in the handling of SetCustomMiningJob messages, leading to the JDC not registering job IDs properly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working job-declarator-client
Projects
Status: Todo 📝
Development

No branches or pull requests

5 participants