Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(instrumentor): reconcile device on restarts #1710

Merged
merged 11 commits into from
Nov 10, 2024

Conversation

blumamir
Copy link
Collaborator

@blumamir blumamir commented Nov 8, 2024

At the moment, the sequence of events is:

  1. user deleted a source from ui
  2. the label is removed from the workload object (deployment etc).
  3. the instrumentor detects the label is removed and delete instrumented application and instrumentation config.
  4. when instrumented application is deleted, the device is cleared from workload pod spec with a relevant controller.

Device should be removed if the instrumented application is deleted, and specifically in these cases:

  • if instrumentor restarts when the instrumented application has been deleted but the device cleanup where not applied yet.
  • when 2 controllers perform the device logic at the same time, and one is deleting the device from the workload, while the other one, which might be using old state, will attempt to reintroduce it.

I reproduced these cases locally with old version, observed the device is not cleaned up properly in these edge cases, and then run the changes in this PR to make sure these are fixed.

Copy link
Contributor

@edeNFed edeNFed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

// based on the current state of the cluster.
// if the instrumented application is deleted but the device is not cleaned,
// the instrumented application controller will not be invoked after restart, which is why we need to handle this case here.
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a check here that the odigos label is present?
Since after this change, when the instrumentor is starting this will cause it to have an event for each workload in the cluster - even those which odigos does not care about.
For those the new check added in this PR below if apierrors.IsNotFound(err) will be true.
That means the even for workloads which were not instrumented at all we will call removeInstrumentationDeviceFromWorkload.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle workloads which has no label.

Consider the following sequence:

  • label is removed from a workload which has the device
  • instrumented application deleted
  • instrumentor went down

When the instrumentor starts, we need to remove the device from this workload. if we add the filter you suggested, nothing will remove the device from this workload

Copy link
Contributor

@RonFed RonFed Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if there is a device present in the pod spec? Since we have the workload object here I think we can check that.

I assume it will be fine with the current approach as well, but since this has the potential of bringing a huge batch of events, we should try to handle them as fast as possible by filtering here.

Copy link
Contributor

@RonFed RonFed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refine the Create condtion

instrumentor/controllers/instrumentationdevice/common.go Outdated Show resolved Hide resolved
}

result, err := controllerutil.CreateOrPatch(ctx, kubeClient, workloadObj, func() error {
err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should verify that this behaves as we expect by doing a large batch of un-instrumentation and adding some logs here for debug

instrumentor/controllers/instrumentationdevice/common.go Outdated Show resolved Hide resolved
// if we didn't change anything, we don't need to update the object
// skip the api-server call, return no-op and skip the log message
if !webhookLabelRemoved && !deviceRemoved && !envChanged {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should return here a custom error indicating that nothing has been done?
This is for logging below - to avoid logging for un-relevant workloads.

Another option is to set a bool updatedWorkload in the removeInstrumentationDeviceFromWorkload scope - initialize it to true - here we will set it to false. This will allow us to know if nil is returned because of a successful update or because of this case.

Comment on lines +48 to +58
if apierrors.IsNotFound(err) {
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName)
if err != nil {
return err
}
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails)
return err
} else {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if apierrors.IsNotFound(err) {
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName)
if err != nil {
return err
}
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails)
return err
} else {
return err
}
if apierrors.IsNotFound(err) {
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName)
if err != nil {
return err
}
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails)
}
return err

return err
}
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that a conflict error will be printed by the controller runtime - right?

@RonFed RonFed merged commit 3b6e41a into odigos-io:main Nov 10, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants