[BUG] Improve error handling and reconciliation logic for compaction controller #698
Labels
area/quality
Output qualification (tests, checks, scans, automation in general, etc.) related
exp/beginner
Issue that requires only basic skills
kind/bug
Bug
status/closed
Issue is closed (either delivered or triaged)
Milestone
Describe the bug:
Compaction controller's purpose is to deploy a compaction job if it detects that a compaction job is necessary, due to possible triggers such as delta snapshot revision exceeding full snapshot revision by a given threshold (
etcd-events-threshold
), and in the future even upon the total delta snapshots' size crossing a certain threshold. Information about snapshot revisions is available today from snapshot leases maintained by the leading backup sidecar, and in the future will be published via EtcdMember resource status.Once the compaction job has been deployed, the controller requeues reconciliation. Until the compaction job succeeds or fails, the reconciliation must keep requeueing and checking compaction job status. Once compaction job succeeds/fails, the snapshot revision information will be updated by the compaction job itself, and compaction controller will know not to deploy a new compaction job until the next time a snapshot compaction trigger is received.
All the logic for reconciling the compaction job lies within reconcileJob(). The issue with the current logic in compaction controller is that this method is called only when the compaction threshold is reached. But if a new full (regular scheduled) snapshot is uploaded by the leading backup sidecar while a compaction job is still running, then compaction controller never runs
reconcileJob()
because the snapshot revision threshold is no longer reached. This prevents the controller from reconciling the currently running compaction job to either cut it short, or atleast wait for completion and export the correct metrics for it.Additionally, there are certain helper functions such as getCompactionJobVolumeMounts(), getCompactionJobVolumes(), getCompactionJobEnvVar(), which are used to generate the compaction job spec. All these functions make a call
utils.StorageProviderFromInfraProvider(etcd.Spec.Backup.Store.Provider)
, which returns the storage provider for the compaction job, and returns an error if the provider is unrecognised. But these functions do not correctly handle the error, and simply log the error and return an incomplete result which will cause the compaction job pod to fail to start due to a misconfigured job spec. Ideally, the call toutils.StorageProviderFromInfraProvider()
must be made and error should be correctly handled within Reconcile() method itself, even before trying to generate the job spec. Additionally, the store nil check being made initially in Reconcile() is also being made redundantly in all the helper functions, and must be removed.There are other instances in the compaction controller where errors are not handled correctly, and simply logged, rather than blocking the deployment of the compaction job, for instance
etcd-druid/controllers/compaction/reconciler.go
Lines 389 to 392 in 00274d0
Expected behavior:
Reconciliation should be precise, and must handle all cases of snapshots being taken by leading backup sidecar and compaction job. Errors must be correctly handled.
How To Reproduce (as minimally and precisely as possible):
Logs:
Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: