[BUG] Improve error handling and reconciliation logic for compaction controller #698

shreyas-s-rao · 2023-10-04T09:32:41Z

Describe the bug:
Compaction controller's purpose is to deploy a compaction job if it detects that a compaction job is necessary, due to possible triggers such as delta snapshot revision exceeding full snapshot revision by a given threshold (etcd-events-threshold), and in the future even upon the total delta snapshots' size crossing a certain threshold. Information about snapshot revisions is available today from snapshot leases maintained by the leading backup sidecar, and in the future will be published via EtcdMember resource status.

Once the compaction job has been deployed, the controller requeues reconciliation. Until the compaction job succeeds or fails, the reconciliation must keep requeueing and checking compaction job status. Once compaction job succeeds/fails, the snapshot revision information will be updated by the compaction job itself, and compaction controller will know not to deploy a new compaction job until the next time a snapshot compaction trigger is received.

All the logic for reconciling the compaction job lies within reconcileJob(). The issue with the current logic in compaction controller is that this method is called only when the compaction threshold is reached. But if a new full (regular scheduled) snapshot is uploaded by the leading backup sidecar while a compaction job is still running, then compaction controller never runs reconcileJob() because the snapshot revision threshold is no longer reached. This prevents the controller from reconciling the currently running compaction job to either cut it short, or atleast wait for completion and export the correct metrics for it.

Additionally, there are certain helper functions such as getCompactionJobVolumeMounts(), getCompactionJobVolumes(), getCompactionJobEnvVar(), which are used to generate the compaction job spec. All these functions make a call utils.StorageProviderFromInfraProvider(etcd.Spec.Backup.Store.Provider), which returns the storage provider for the compaction job, and returns an error if the provider is unrecognised. But these functions do not correctly handle the error, and simply log the error and return an incomplete result which will cause the compaction job pod to fail to start due to a misconfigured job spec. Ideally, the call to utils.StorageProviderFromInfraProvider() must be made and error should be correctly handled within Reconcile() method itself, even before trying to generate the job spec. Additionally, the store nil check being made initially in Reconcile() is also being made redundantly in all the helper functions, and must be removed.

There are other instances in the compaction controller where errors are not handled correctly, and simply logged, rather than blocking the deployment of the compaction job, for instance

etcd-druid/controllers/compaction/reconciler.go

Lines 389 to 392 in 00274d0

    
           if storeValues.SecretRef == nil { 
        
           	logger.Info("No secretRef is configured for backup store. Compaction job will fail as no storage could be configured.", 
        
           		"namespace", etcd.Namespace, "name", etcd.Name) 
        
           	return vs

. These too need to be correctly handled.

Expected behavior:
Reconciliation should be precise, and must handle all cases of snapshots being taken by leading backup sidecar and compaction job. Errors must be correctly handled.

How To Reproduce (as minimally and precisely as possible):

Logs:

Screenshots (if applicable):

Environment (please complete the following information):

Etcd version/commit ID :
Etcd-druid version/commit ID :
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

shreyas-s-rao · 2023-10-12T05:31:47Z

/assign @abdasgupta

abdasgupta · 2023-10-16T05:41:57Z

Already working on this

shreyas-s-rao · 2023-11-19T07:44:53Z

Resolved by #711
/close

shreyas-s-rao added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related exp/beginner Issue that requires only basic skills kind/bug Bug labels Oct 4, 2023

gardener-robot assigned abdasgupta Oct 12, 2023

shreyas-s-rao mentioned this issue Oct 12, 2023

[BUG] Make the metric metricsjobcurrent capture accurate job end time #685

Closed

shreyas-s-rao mentioned this issue Nov 1, 2023

Reconciles compaction job on job status change along with snapshot lease change #711

Merged

gardener-robot closed this as completed Nov 19, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 19, 2023

shreyas-s-rao added this to the v0.21.0 milestone Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Improve error handling and reconciliation logic for compaction controller #698

[BUG] Improve error handling and reconciliation logic for compaction controller #698

shreyas-s-rao commented Oct 4, 2023

shreyas-s-rao commented Oct 12, 2023

abdasgupta commented Oct 16, 2023

shreyas-s-rao commented Nov 19, 2023

[BUG] Improve error handling and reconciliation logic for compaction controller #698

[BUG] Improve error handling and reconciliation logic for compaction controller #698

Comments

shreyas-s-rao commented Oct 4, 2023

shreyas-s-rao commented Oct 12, 2023

abdasgupta commented Oct 16, 2023

shreyas-s-rao commented Nov 19, 2023