Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix building unified delta WAL, unified delta CRLs #20058

Merged
merged 9 commits into from
Apr 11, 2023

Conversation

cipherboy
Copy link
Contributor

Built on top of #20057; will be rebased once that merges.


Perhaps best to review this commit by commit as it should make most sense that way. If short summaries are nice:

  1. We only considered certificates from the current primary cluster when building the unified delta CRL.
  2. We only considered last-revoked serials from the current primary cluster when deciding to build the unified delta CRL.
  3. We only wrote out last-revoked serials to the current primary cluster's last-built entry, meaning if 2 wasn't also a bug, we could potentially rebuild more frequently than necessary (without new revocations).
  4. When doing the build, the last read serial wasn't the current primary cluster's cross-cluster last revoked serial, but instead its local delta WAL last revoked serial. These entries were effectively the same (since the cross-cluster storage writer for the primary cluster should've just been a local write), so this only mattered post 2 being fixed.
  5. Given that there were PBPWF failures in the writer, and given the change in Log, don't err, on unified delta WAL write failure #20057, we need logic similar to the existing full revocation entries for copying delta WAL entries cross-cluster if PBPWF write failed for some reason.
  6. Cleanup commit to switch a log to an error message.
  7. Add warnings to the revocation handler when PBPWF writing fails.
  8. Only attempt writing the unified delta WAL entry if the full entry succeeded.

But due to an obvious failure in the PBPWF writer, we weren't going to see any of these bugs.

@cipherboy cipherboy added bug Used to indicate a potential bug secret/pki backport/1.13.x labels Apr 10, 2023
@cipherboy cipherboy added this to the 1.14 milestone Apr 10, 2023
@cipherboy cipherboy requested review from kitography, stevendpclark and a team April 10, 2023 13:12
@cipherboy cipherboy force-pushed the cipherboy-fix-building-delta-wal branch 2 times, most recently from 9515ce8 to 6c36430 Compare April 11, 2023 13:39
return fmt.Errorf("unable to create cross-cluster unified last delta CRL WAL entry: %w", err)
}
if err = sc.Storage.Put(sc.Context, lastWALEntry); err != nil {
return fmt.Errorf("error saving cross-cluster unified last delta CRL WAL entry: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this Put operation could roll us back to an older wal entry if newer ones are coming in while we are recovering from a previous disconnect. Not sure if re-reading the local values before putting to see if they are the same would be a good enough test to avoid the race?

Copy link
Contributor Author

@cipherboy cipherboy Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevendpclark From our call, I've pushed the change identified about running periodic when the listing was different. In this case, this Put call on the old entry could at worse cause us to run again (if we encounter an error) and force more than one delta CRL rebuild (of otherwise identical delta CRLs). We have this scenario in a few other places, so I think its fine to leave this as-is, without the re-read.

In the event of issues with replication, most users will likely want to issue a full CRL rebuild on all clusters (including unified and local CRLs).

When building the unified delta CRL, WAL entries from the non-primary
cluster were ignored. This resulted in an incomplete delta CRL,
preventing some entries from appearing.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
When deciding if the Unified Delta CRL should be rebuilt, we need to
check the status of all clusters and their last revoked serial numbers.
If any new serial has been revoked on any cluster, we should rebuild the
unified delta CRLs.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
When building the unified CRL, we need to read the last seen serial
number from all clusters, not just the present cluster, and write it
to the last built serial for that cluster's unified delta WAL entry.
This prevents us from continuously rebuilding unified CRLs now that we
have fixed our rebuild heuristic.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
getLastWALSerial ignored its path argument, preventing it from reading
the specified cluster-specific WAL entry. On the primary cluster, this
was mostly equivalent, but now that we're correctly reading WAL entries
and revocations for other clusters, we need to handle reading these
entries correctly.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
@cipherboy cipherboy force-pushed the cipherboy-fix-building-delta-wal branch from 6c36430 to 74b40f9 Compare April 11, 2023 14:40
Any local delta WAL should be persisted to unified delta WAL space as
well. If such unified persistence fails, we need to ensure that they get
eventually moved up, otherwise they'll remain missing until the next
full CRL rebuild occurs, which might be significantly longer than when
the next delta CRL rebuild would otherwise occur. runUnifiedTransfer
already handles this for us, but it lacked logic for delta WAL serials.

The only interesting catch here is that we refuse to copy any entries
whose full unified revocation entry has not also been written.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
This message is mostly an error and would always be helpful information
to have when troubleshooting failures.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
When revoking certificates, we log cross-cluster revocation failures,
but we should really expose this information to the caller, that their
local revocation was successful, but their cross-cluster revocation
failed.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
Delta WAL entries are empty files whose only information (a revoked
serial number) is contained in the file path. These depend implicitly on
a full revocation entry existing for this file (whether a cross-cluster
unified entry or a local entry).

We should not write unified delta WAL entries without the corresponding
full unified revocation entry existing. Add a warning in this case.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
@cipherboy cipherboy force-pushed the cipherboy-fix-building-delta-wal branch from 74b40f9 to 09d4c70 Compare April 11, 2023 17:36
@cipherboy cipherboy enabled auto-merge (squash) April 11, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug secret/pki
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants