Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt ctclone for 'coerced get-entries' in Let's Encrypt and other CT Logs #1055

Merged
merged 4 commits into from
Mar 20, 2024

Conversation

tired-engineer
Copy link
Contributor

Adapt ctclone to handle the 'coerced get-entries' feature by: trying to retrieve just one request, checking the number of returned results, either continuing with the rest workers, or stopping fetching, storing the incomplete tile, exiting delegating orchestration system to start the process again.

Fixes #1044

@@ -108,24 +108,24 @@ type ctFetcher struct {
// Batch provides a mechanism to fetch a range of leaves.
// Enough leaves are fetched to fully fill `leaves`, or an error is returned.
// This implements batch.BatchFetch.
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error {
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhutchinson what's the reason the Batch function signature takes an "OUT" parameter rather than being something like f(start, N uint64) ([][]byte, error) and just returning what was available?

It feels like the leaves param is kinda doing a slightly unusual "double duty" here: informing the implementation of the max to fetch as well as being the container to put them in, which then leads to having to special case "short" reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if this was a performance thing, or trying to make this look something like io.Reader. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader, nor your suggestion.

Leaf: nil,
Err: r.err,

if i == 0 && !alignmentDone && r.err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may end up being a bit brittle, too - if the source log changes the number of entries it returns (e.g. due to administrative config changes, or some future log implementation does something like returning as many leaves as it has managed to read within some [shortish] deadline) then it'll break again.

If workers were given larger, non-overlapping, ranges to chew on, they could each do something like:

for start < start+N {
  r, err := Batch(start, start+N)
  // handle err, send r... to channel
  start += len(r)

which should adapt to whatever number of entries the log returns.
The layer above can consume from worker channels in order (i.e. drain the channel from the lowest range worker before moving on to the next), when a worker has finished, a new one could be kicked off with the next unassigned larger range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.

If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.

Leaf: nil,
Err: r.err,

if i == 0 && !alignmentDone && r.err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a known brittleness, but a pragmatic step forward to fix the immediate issue. See discussion in #1044 about how this running in something like docker would restart the job and hopefully lead to snapping into alignment. The current code bakes in the batch size quite early on, and this is only intended to fix a specific observed issue and not all conceivable alignment issues.

If we want this to be fully general in the face of logs that regularly don't respect request parameters then we need to take a more comprehensive step back and rewrite.

@@ -108,24 +108,24 @@ type ctFetcher struct {
// Batch provides a mechanism to fetch a range of leaves.
// Enough leaves are fetched to fully fill `leaves`, or an error is returned.
// This implements batch.BatchFetch.
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) error {
func (cf ctFetcher) Batch(start uint64, leaves [][]byte) (uint64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if this was a performance thing, or trying to make this look something like io.Reader. I don't think the original contract was necessarily bad, nor is this change (which actually brings it closer to io.Reader, nor your suggestion.

@@ -48,25 +47,34 @@ func Bulk(ctx context.Context, first, treeSize uint64, batchFetch BatchFetch, wo
rangeChans := make([]chan workerResult, workers)

increment := workers * batchSize

waitCh := make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating this channel and then doing alignment inside the concurrency below, I imagined that this change would do the alignment in the main thread here, and once it had got alignment, would then kick off the concurrency by continuing on with an updated start. Is this possible?

I have to jump to meetings now, but LMK if this doesn't make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, there is definitely no rush.
It does make sense, what you describe, and that's what I tried to do initially, but it requires more changes:
either make fetchWorker.run() to return results it a blocking way or duplicate the main loop internals to handle backoffs and sending results to cloner while properly maintaining the order.
The current implementation is rather "minimalistic", although might be a bit less obvious.

@mhutchinson
Copy link
Contributor

Hi @tired-engineer, I've finally emerged from the deep rabbit hole of "other stuff" that I was context loaded on, and have given this a bit of time. Apologies for the delay. What do you think about 6f6ea60 ?

This unwinds the logic for making the first goroutine special (and the other goroutines knowing it). Instead, the main thread does the work. After doing this change, it was only a small amount more code to make it so that it self heals, so it actually tries the alignment twice if the first alignment doesn't get a full batch.

@tired-engineer
Copy link
Contributor Author

Sorry @mhutchinson, for such a long delay. I'd say it's good (at least for me).

@mhutchinson
Copy link
Contributor

Sorry @mhutchinson, for such a long delay. I'd say it's good (at least for me).

No worries, thanks for coming back! If you want to take credit for your work on this then please:

  1. cherry pick my changes into this PR
  2. rebase
  3. Mark the PR as ready for review

We'll get this merged :-)

If I've not heard from you by the end of the month then I'll do this myself and we'll give you credit via the commit message and changelog.

tired-engineer and others added 4 commits March 19, 2024 14:19
… implementation of various behaviours on incomplete batches.
This keeps the logic cleaner inside the goroutines doing the parallel downloads, and means that a one-off alignment fix can be realized without crashing the process. The clone tool will still crash if alignment is needed after the initial batch or two, but that's outside the scope of this change.
@tired-engineer tired-engineer marked this pull request as ready for review March 19, 2024 13:20
@tired-engineer tired-engineer requested a review from a team as a code owner March 19, 2024 13:20
@tired-engineer tired-engineer requested review from AlCutter and removed request for a team March 19, 2024 13:20
@mhutchinson mhutchinson merged commit 8008fd7 into google:master Mar 20, 2024
4 checks passed
@mhutchinson
Copy link
Contributor

@tired-engineer thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ctclone Fails with Incomplete CT Log Responses
3 participants