JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

GitGab19 · 2024-05-21T12:24:49Z

While working on #901, I noticed that there is a particular case in which JDC stops receiving valid shares from Translator. So it does not send them anymore to pool.
This is the flow which causes this issue:

NewTemplate1
SetNewPrevHash1
DeclareMiningJob1 (future job)
a valid share1 if found and submitted from Translator to JDC
NewTemplate2
SetNewPrevHash2
ProvideMissingTransactions1
ProvideMissingTransactionsSuccess1
DeclareMiningJobSuccess1
DeclareMiningJob2 (a new future job)
....
DeclareMiningJobSucces2
SetCustomMiningJob2

After this, no more valid shares are received from JDC, even if they are found by Translator.

I started to dig into it, and I'm pretty sure the point where JDC gets stuck is in downstream.rs.
The reason is that UpstreamMiningNode::get_job_id never return in this case (because there's a loop in there) since the job_id is not inserted through this line. The reason is that for that specific job, a SetCustomMiningJob1 was not sent (because in the meantime a new prev_hash and a new template has arrived, look at the messages flow).

The text was updated successfully, but these errors were encountered:

GitGab19 · 2024-05-21T13:39:36Z

I pushed a timeout as a temporary fix for this, but a more robust solution to this has to be found.
@Fi3 gave me some suggestions privately. Maybe you can explain better them here? 🙏

xyephy · 2024-05-30T19:54:29Z

Hi @GitGab19, how would I reproduce the issue?
Also might you have any new findings that might help me in solving this?

GitGab19 · 2024-06-03T13:04:47Z

Hi @GitGab19, how would I reproduce the issue? Also might you have any new findings that might help me in solving this?

We are working on adding a MG test which will be able to reproduce this case.
However, I already described the set of messages which leads to this issue here: #920 (comment)

plebhash · 2024-07-02T22:45:20Z

hey @xyephy what's your status on this?

we have this marked as high priority on our Project Board (milestone 1.0.2), so in case you're not able to keep going, I'll take this over

please let me know if you have already done some progress that could accelerate my efforts here

plebhash · 2024-07-08T22:52:47Z

NewTemplate1

SetNewPrevHash1

DeclareMiningJob1 (future job)

a valid share1 if found and submitted from Translator to JDC

NewTemplate2

SetNewPrevHash2

ProvideMissingTransactions1

ProvideMissingTransactionsSuccess1

DeclareMiningJobSuccess1

DeclareMiningJob2 (a new future job)

...

DeclareMiningJobSucces2

SetCustomMiningJob2

@GitGab19 how is share1 found before DeclareMiningJobSuccess1?

is that share from a previous job (e.g.: 0, not 1)? or is there some mistake there?

I have the impression that you just wanted to highlight that everything is working as expected up until this point in the message flow, but I want to be sure.

plebhash · 2024-07-09T16:28:53Z

if we realize this will take too much time/effort, we should consider moving it to milestone 1.0.3

plebhash · 2024-07-11T22:45:06Z

Today I had a call with @rrybarczyk + @Shourya742 where we deep-dived into this issue and started drafting a visual diagram with the message flow of this issue (this visual diagram should be seen as a WIP and might contain inaccuracies).

We left the call feeling quite confident about our progress, but I'm revisiting our diagram a few hours later with fresh eyes and I now I realize we need to do a sanity check on some of the assumptions we made during the call.

What really stands out to me right now is that we assumed that a SetCustomMiningJob.Success is a prerequisite for JDC to send a NewExtendedMiningJob downstream. We used this assumption as the foundation for how we ordered these messages in the diagram.

However, if we attentively read the comments on JDC's main.rs, it tells a different story:

stratum/roles/jd-client/src/main.rs

Lines 69 to 94 in 7841021

    
           /// Main loop: 
        
           /// 1. TemplateRx: <-NewTemplate, SetNewPrevHash 
        
           /// 2. JobDeclarator: -> CommitMiningJob (JobDeclarator::on_new_template), <-CommitMiningJobSuccess 
        
           /// 3. Upstream: ->SetCustomMiningJob, Downstream: ->NewExtendedMiningJob, ->SetNewPrevHash 
        
           /// 4. Downstream: <-Share 
        
           /// 5. Upstream: ->Share 
        
           /// 
        
           /// When we have a NewTemplate we send the NewExtendedMiningJob downstream and the CommitMiningJob 
        
           /// to the JDS altoghether. 
        
           /// Then we receive CommitMiningJobSuccess and we use the new token to send SetCustomMiningJob to 
        
           /// the pool. 
        
           /// When we receive SetCustomMiningJobSuccess we set in Upstream job_id equal to the one received 
        
           /// in SetCustomMiningJobSuccess so that we sill send shares upstream with the right job_id. 
        
           /// 
        
           /// The above procedure, let us send NewExtendedMiningJob downstream right after a NewTemplate has 
        
           /// been received this will reduce the time that pass from a NewTemplate and the mining-device 
        
           /// starting to mine on the new job. 
        
           /// 
        
           /// In the case a future NewTemplate the SetCustomMiningJob is sent only if the canditate become 
        
           /// the actual NewTemplate so that we do not send a lot of usless future Job to the pool. That 
        
           /// means that SetCustomMiningJob is sent only when a NewTemplate becom "active" 
        
           /// 
        
           /// The JobDeclarator always have 2 avaiable token, that means that whenever a token is used to 
        
           /// commit a job with upstream we require a new one. Having always a token when needed means that 
        
           /// whenever we want to commit a mining job we can do that without waiting for upstream to provide 
        
           /// a new token.

This is also connected to the original issue description by @GitGab19 when he said:

The reason is that for that specific job, a SetCustomMiningJob was not sent (because in the meantime a new prev_hash and a new template has arrived)

This suggests that the root cause is related to SetCustomMiningJob never being sent, while during our call we believed that the root cause was a delayed SetCustomMiningJob.Success response from the pool.

So our assumption that a SetCustomMiningJob.Success being a prerequisite for JDC to send a NewExtendedMiningJob downstream is not true.

And I don't know if this is a feature or a bug.

We will definitely need more calls to deep dive into this issue and refine the visual diagram before we even start writing MG tests or thinking about a long-term bugfix.

And we should definitely move this issue away from 1.0.2, because we will not be able to finish this in the immediate future (cc @pavlenex ).

Note:

SV2 message ordering is a very dense topic. We are dealing with scenarios of distributed computing happening across 5 different entities, each doing fundamentally different things.

It is strategically important to have frequent collective exercises where we go through the message flow while regularly doing sanity checks on each other.

These exercises should allow SRI contributors to get a solid foundation on SV2 message flows and also spread/decentralize this knowledge across the community.

Fi3 · 2024-07-12T06:39:10Z

This is definitely a feature you want to start mining on a job as soon as you know the job. You assume that the pool is going to accept the job (that is what it should always happen if everything works). See also my comment here stratum-mining/sv2-spec#80 (reply in thread)

GitGab19 · 2024-07-16T09:27:35Z

NewTemplate1

SetNewPrevHash1

DeclareMiningJob1 (future job)

a valid share1 if found and submitted from Translator to JDC

NewTemplate2

SetNewPrevHash2

ProvideMissingTransactions1

ProvideMissingTransactionsSuccess1

DeclareMiningJobSuccess1

DeclareMiningJob2 (a new future job)

...

DeclareMiningJobSucces2

SetCustomMiningJob2

@GitGab19 how is share1 found before DeclareMiningJobSuccess1?

is that share from a previous job (e.g.: 0, not 1)? or is there some mistake there?

I have the impression that you just wanted to highlight that everything is working as expected up until this point in the message flow, but I want to be sure.

We already discussed it during last "diagramming" call, but I write it here just for the record.
Share1 is found in that point in the flow because as @Fi3 said the miner starts to mine on the new job as soon as a new job is created by JDC. And in the meantime all the JD process is executed.

I wanted to hightlight that, because it seems that since the job to which the share1 is belonging has never been communicated to the pool (through SetCustomMiningJob1), everything starts to break as described by me in the issue description.

This PR addresses issue stratum-mining#920, where JDC stops receiving valid shares from the Translator when two new blocks are found within a few seconds. The problem occurs due to a race condition in the handling of SetCustomMiningJob messages, leading to the JDC not registering job IDs properly.

GitGab19 added bug Something isn't working job-declarator-client labels May 21, 2024

GitGab19 added a commit to GitGab19/stratum that referenced this issue May 21, 2024

basic timeout as a solution of issue stratum-mining#920

6a3c1d6

GitGab19 mentioned this issue May 21, 2024

Better management of LastDeclareJobs - no more wrong fallback-system activations #904

Merged

pavlenex added this to the 1.0.2 milestone May 21, 2024

plebhash assigned xyephy Jun 4, 2024

plebhash self-assigned this Jul 2, 2024

plebhash removed this from the 1.0.2 milestone Jul 15, 2024

pavlenex added this to the 1.0.3 milestone Jul 16, 2024

plebhash removed their assignment Aug 19, 2024

plebhash removed this from the 1.0.3 milestone Aug 20, 2024

GitGab19 mentioned this issue Aug 22, 2024

SRI most important, but NOT critical bugs #1138

Open

4 tasks

xyephy mentioned this issue Aug 26, 2024

Fix for issue where JDC stops receiving shares after rapid block succession #1146

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

GitGab19 commented May 21, 2024 •

edited

Loading

GitGab19 commented May 21, 2024

xyephy commented May 30, 2024 •

edited

Loading

GitGab19 commented Jun 3, 2024

plebhash commented Jul 2, 2024

plebhash commented Jul 8, 2024 •

edited

Loading

plebhash commented Jul 9, 2024

plebhash commented Jul 11, 2024

Fi3 commented Jul 12, 2024

GitGab19 commented Jul 16, 2024

JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

JDC does not receive shares anymore from Translator if two new blocks are found in few seconds #920

Comments

GitGab19 commented May 21, 2024 • edited Loading

GitGab19 commented May 21, 2024

xyephy commented May 30, 2024 • edited Loading

GitGab19 commented Jun 3, 2024

plebhash commented Jul 2, 2024

plebhash commented Jul 8, 2024 • edited Loading

plebhash commented Jul 9, 2024

plebhash commented Jul 11, 2024

Fi3 commented Jul 12, 2024

GitGab19 commented Jul 16, 2024

GitGab19 commented May 21, 2024 •

edited

Loading

xyephy commented May 30, 2024 •

edited

Loading

plebhash commented Jul 8, 2024 •

edited

Loading