Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

tired-engineer · 2024-01-16T12:49:33Z

Adapt ctclone to handle the 'coerced get-entries' feature by: trying to retrieve just one request, checking the number of returned results, either continuing with the rest workers, or stopping fetching, storing the incomplete tile, exiting delegating orchestration system to start the process again.

Fixes #1044

AlCutter · 2024-01-16T13:37:28Z

clone/cmd/ctclone/ctclone.go

@@ -108,24 +108,24 @@ type ctFetcher struct {
 // Batch provides a mechanism to fetch a range of leaves.
 // Enough leaves are fetched to fully fill `leaves`, or an error is returned.
 // This implements batch.BatchFetch.
-func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error {
+func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) {


@mhutchinson what's the reason the Batch function signature takes an "OUT" parameter rather than being something like f(start, N uint64) ([][]byte, error) and just returning what was available?

It feels like the leaves param is kinda doing a slightly unusual "double duty" here: informing the implementation of the max to fetch as well as being the container to put them in, which then leads to having to special case "short" reads.

I can't remember if this was a performance thing, or trying to make this look something like io.Reader. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader, nor your suggestion.

AlCutter · 2024-01-16T14:02:53Z

clone/internal/download/batch.go

-				Leaf: nil,
-				Err:  r.err,
+
+		if i == 0 && !alignmentDone && r.err == nil {


I think this may end up being a bit brittle, too - if the source log changes the number of entries it returns (e.g. due to administrative config changes, or some future log implementation does something like returning as many leaves as it has managed to read within some [shortish] deadline) then it'll break again.

If workers were given larger, non-overlapping, ranges to chew on, they could each do something like:

for start < start+N { r, err := Batch(start, start+N) // handle err, send r... to channel start += len(r)

which should adapt to whatever number of entries the log returns.
The layer above can consume from worker channels in order (i.e. drain the channel from the lowest range worker before moving on to the next), when a worker has finished, a new one could be kicked off with the next unassigned larger range.

Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.

If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.

mhutchinson · 2024-01-16T15:15:10Z

clone/internal/download/batch.go

-				Leaf: nil,
-				Err:  r.err,
+
+		if i == 0 && !alignmentDone && r.err == nil {


Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.

If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.

mhutchinson · 2024-01-16T15:23:01Z

clone/cmd/ctclone/ctclone.go

@@ -108,24 +108,24 @@ type ctFetcher struct {
 // Batch provides a mechanism to fetch a range of leaves.
 // Enough leaves are fetched to fully fill `leaves`, or an error is returned.
 // This implements batch.BatchFetch.
-func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error {
+func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) {


I can't remember if this was a performance thing, or trying to make this look something like io.Reader. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader, nor your suggestion.

mhutchinson · 2024-01-16T15:30:28Z

clone/internal/download/batch.go

@@ -48,25 +47,34 @@ func Bulk(ctx context.Context, first, treeSize uint64, batchFetch BatchFetch, wo
 	rangeChans := make([]chan workerResult, workers)

 	increment := workers * batchSize
+
+	waitCh := make(chan struct{})


Instead of creating this channel and then doing alignment inside the concurrency below, I imagined that this change would do the alignment in the main thread here, and once it had got alignment, would then kick off the concurrency by continuing on with an updated start. Is this possible?

I have to jump to meetings now, but LMK if this doesn't make sense.

No worries, there is definitely no rush.
It does make sense, what you describe, and that's what I tried to do initially, but it requires more changes:
either make fetchWorker.run() to return results it a blocking way or duplicate the main loop internals to handle backoffs and sending results to cloner while properly maintaining the order.
The current implementation is rather "minimalistic", although might be a bit less obvious.

mhutchinson · 2024-02-15T14:28:27Z

Hi @tired-engineer, I've finally emerged from the deep rabbit hole of "other stuff" that I was context loaded on, and have given this a bit of time. Apologies for the delay. What do you think about 6f6ea60 ?

This unwinds the logic for making the first goroutine special (and the other goroutines knowing it). Instead, the main thread does the work. After doing this change, it was only a small amount more code to make it so that it self heals, so it actually tries the alignment twice if the first alignment doesn't get a full batch.

tired-engineer · 2024-03-14T21:24:20Z

Sorry @mhutchinson, for such a long delay. I'd say it's good (at least for me).

mhutchinson · 2024-03-18T09:56:29Z

Sorry @mhutchinson, for such a long delay. I'd say it's good (at least for me).

No worries, thanks for coming back! If you want to take credit for your work on this then please:

cherry pick my changes into this PR
rebase
Mark the PR as ready for review

We'll get this merged :-)

If I've not heard from you by the end of the month then I'll do this myself and we'll give you credit via the commit message and changelog.

…2467.

… implementation of various behaviours on incomplete batches.

This keeps the logic cleaner inside the goroutines doing the parallel downloads, and means that a one-off alignment fix can be realized without crashing the process. The clone tool will still crash if alignment is needed after the initial batch or two, but that's outside the scope of this change.

mhutchinson · 2024-03-20T11:05:42Z

@tired-engineer thanks for your work on this!

AlCutter reviewed Jan 16, 2024

View reviewed changes

mhutchinson reviewed Jan 16, 2024

View reviewed changes

tired-engineer and others added 4 commits March 19, 2024 14:19

Update Bulk download method description as a result of the commit b7f…

7309264

…2467.

Return the number of fetched entries by fetchWorker to allow upcoming…

8f2e42f

… implementation of various behaviours on incomplete batches.

Introduce an 'alignment' call to CT log to cover the coerced page case.

f9ed680

tired-engineer force-pushed the local branch from 8200c9f to 1ff05b5 Compare March 19, 2024 13:20

tired-engineer marked this pull request as ready for review March 19, 2024 13:20

tired-engineer requested a review from a team as a code owner March 19, 2024 13:20

tired-engineer requested review from AlCutter and removed request for a team March 19, 2024 13:20

mhutchinson approved these changes Mar 20, 2024

View reviewed changes

mhutchinson merged commit 8008fd7 into google:master Mar 20, 2024
4 checks passed

mhutchinson mentioned this pull request Jul 16, 2024

"failed to clone and verify log" error while using sumdbclone example #1188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

tired-engineer commented Jan 16, 2024

AlCutter Jan 16, 2024

mhutchinson Jan 16, 2024

AlCutter Jan 16, 2024

mhutchinson Jan 16, 2024

mhutchinson Jan 16, 2024

mhutchinson Jan 16, 2024

mhutchinson Jan 16, 2024

tired-engineer Jan 22, 2024

mhutchinson commented Feb 15, 2024

tired-engineer commented Mar 14, 2024

mhutchinson commented Mar 18, 2024

mhutchinson commented Mar 20, 2024

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

Conversation

tired-engineer commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhutchinson commented Feb 15, 2024

tired-engineer commented Mar 14, 2024

mhutchinson commented Mar 18, 2024

mhutchinson commented Mar 20, 2024