Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Improve error handling and reconciliation logic for compaction controller #698

Closed
shreyas-s-rao opened this issue Oct 4, 2023 · 3 comments
Assignees
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related exp/beginner Issue that requires only basic skills kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@shreyas-s-rao
Copy link
Contributor

Describe the bug:
Compaction controller's purpose is to deploy a compaction job if it detects that a compaction job is necessary, due to possible triggers such as delta snapshot revision exceeding full snapshot revision by a given threshold (etcd-events-threshold), and in the future even upon the total delta snapshots' size crossing a certain threshold. Information about snapshot revisions is available today from snapshot leases maintained by the leading backup sidecar, and in the future will be published via EtcdMember resource status.

Once the compaction job has been deployed, the controller requeues reconciliation. Until the compaction job succeeds or fails, the reconciliation must keep requeueing and checking compaction job status. Once compaction job succeeds/fails, the snapshot revision information will be updated by the compaction job itself, and compaction controller will know not to deploy a new compaction job until the next time a snapshot compaction trigger is received.

All the logic for reconciling the compaction job lies within reconcileJob(). The issue with the current logic in compaction controller is that this method is called only when the compaction threshold is reached. But if a new full (regular scheduled) snapshot is uploaded by the leading backup sidecar while a compaction job is still running, then compaction controller never runs reconcileJob() because the snapshot revision threshold is no longer reached. This prevents the controller from reconciling the currently running compaction job to either cut it short, or atleast wait for completion and export the correct metrics for it.

Additionally, there are certain helper functions such as getCompactionJobVolumeMounts(), getCompactionJobVolumes(), getCompactionJobEnvVar(), which are used to generate the compaction job spec. All these functions make a call utils.StorageProviderFromInfraProvider(etcd.Spec.Backup.Store.Provider), which returns the storage provider for the compaction job, and returns an error if the provider is unrecognised. But these functions do not correctly handle the error, and simply log the error and return an incomplete result which will cause the compaction job pod to fail to start due to a misconfigured job spec. Ideally, the call to utils.StorageProviderFromInfraProvider() must be made and error should be correctly handled within Reconcile() method itself, even before trying to generate the job spec. Additionally, the store nil check being made initially in Reconcile() is also being made redundantly in all the helper functions, and must be removed.

There are other instances in the compaction controller where errors are not handled correctly, and simply logged, rather than blocking the deployment of the compaction job, for instance

if storeValues.SecretRef == nil {
logger.Info("No secretRef is configured for backup store. Compaction job will fail as no storage could be configured.",
"namespace", etcd.Namespace, "name", etcd.Name)
return vs
. These too need to be correctly handled.

Expected behavior:
Reconciliation should be precise, and must handle all cases of snapshots being taken by leading backup sidecar and compaction job. Errors must be correctly handled.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-druid version/commit ID :
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

@shreyas-s-rao shreyas-s-rao added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related exp/beginner Issue that requires only basic skills kind/bug Bug labels Oct 4, 2023
@shreyas-s-rao
Copy link
Contributor Author

/assign @abdasgupta

@abdasgupta
Copy link
Contributor

Already working on this

@shreyas-s-rao
Copy link
Contributor Author

Resolved by #711
/close

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 19, 2023
@shreyas-s-rao shreyas-s-rao added this to the v0.21.0 milestone Nov 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related exp/beginner Issue that requires only basic skills kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants