-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055
Conversation
@@ -108,24 +108,24 @@ type ctFetcher struct { | |||
// Batch provides a mechanism to fetch a range of leaves. | |||
// Enough leaves are fetched to fully fill `leaves`, or an error is returned. | |||
// This implements batch.BatchFetch. | |||
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error { | |||
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhutchinson what's the reason the Batch
function signature takes an "OUT" parameter rather than being something like f(start, N uint64) ([][]byte, error) and just returning what was available
?
It feels like the leaves
param is kinda doing a slightly unusual "double duty" here: informing the implementation of the max to fetch as well as being the container to put them in, which then leads to having to special case "short" reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember if this was a performance thing, or trying to make this look something like io.Reader
. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader
, nor your suggestion.
clone/internal/download/batch.go
Outdated
Leaf: nil, | ||
Err: r.err, | ||
|
||
if i == 0 && !alignmentDone && r.err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may end up being a bit brittle, too - if the source log changes the number of entries it returns (e.g. due to administrative config changes, or some future log implementation does something like returning as many leaves as it has managed to read within some [shortish] deadline) then it'll break again.
If workers were given larger, non-overlapping, ranges to chew on, they could each do something like:
for start < start+N {
r, err := Batch(start, start+N)
// handle err, send r... to channel
start += len(r)
which should adapt to whatever number of entries the log returns.
The layer above can consume from worker channels in order (i.e. drain the channel from the lowest range worker before moving on to the next), when a worker has finished, a new one could be kicked off with the next unassigned larger range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.
If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.
clone/internal/download/batch.go
Outdated
Leaf: nil, | ||
Err: r.err, | ||
|
||
if i == 0 && !alignmentDone && r.err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.
If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.
@@ -108,24 +108,24 @@ type ctFetcher struct { | |||
// Batch provides a mechanism to fetch a range of leaves. | |||
// Enough leaves are fetched to fully fill `leaves`, or an error is returned. | |||
// This implements batch.BatchFetch. | |||
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error { | |||
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember if this was a performance thing, or trying to make this look something like io.Reader
. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader
, nor your suggestion.
clone/internal/download/batch.go
Outdated
@@ -48,25 +47,34 @@ func Bulk(ctx context.Context, first, treeSize uint64, batchFetch BatchFetch, wo | |||
rangeChans := make([]chan workerResult, workers) | |||
|
|||
increment := workers * batchSize | |||
|
|||
waitCh := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating this channel and then doing alignment inside the concurrency below, I imagined that this change would do the alignment in the main thread here, and once it had got alignment, would then kick off the concurrency by continuing on with an updated start
. Is this possible?
I have to jump to meetings now, but LMK if this doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, there is definitely no rush.
It does make sense, what you describe, and that's what I tried to do initially, but it requires more changes:
either make fetchWorker.run() to return results it a blocking way or duplicate the main loop internals to handle backoffs and sending results to cloner while properly maintaining the order.
The current implementation is rather "minimalistic", although might be a bit less obvious.
Hi @tired-engineer, I've finally emerged from the deep rabbit hole of "other stuff" that I was context loaded on, and have given this a bit of time. Apologies for the delay. What do you think about 6f6ea60 ? This unwinds the logic for making the first goroutine special (and the other goroutines knowing it). Instead, the main thread does the work. After doing this change, it was only a small amount more code to make it so that it self heals, so it actually tries the alignment twice if the first alignment doesn't get a full batch. |
Sorry @mhutchinson, for such a long delay. I'd say it's good (at least for me). |
No worries, thanks for coming back! If you want to take credit for your work on this then please:
We'll get this merged :-) If I've not heard from you by the end of the month then I'll do this myself and we'll give you credit via the commit message and changelog. |
… implementation of various behaviours on incomplete batches.
This keeps the logic cleaner inside the goroutines doing the parallel downloads, and means that a one-off alignment fix can be realized without crashing the process. The clone tool will still crash if alignment is needed after the initial batch or two, but that's outside the scope of this change.
@tired-engineer thanks for your work on this! |
Adapt ctclone to handle the 'coerced get-entries' feature by: trying to retrieve just one request, checking the number of returned results, either continuing with the rest workers, or stopping fetching, storing the incomplete tile, exiting delegating orchestration system to start the process again.
Fixes #1044