-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in agent Traceflow controller #5954
Conversation
c9d8f13
to
97be1dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
c.runningTraceflowsMutex.Unlock() | ||
// This may happen if a Traceflow is assigned with a tag that was just released from an old Traceflow but | ||
// the agent hasn't processed the deletion event of the old Traceflow yet. | ||
if ok && tfState.name != traceflowName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we see a different TF with the same name before we process the deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible in theory. If we create a TF, delete it, then recreate a new one with the same name. It's possible that the same tag is assigned to it. If agent processes the creation of the 2nd TF first, it will encounter this.
But this shouldn't happen with antctl which appends random string to the name.
To totally prevent it, we should use UID to identify TF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might misunderstand your question. Now I feel you are asking in which case we will see the current bug happen?
It's due to the concurrent workers: we will receive the deletion of old TF before the creation of new TF. But after they are dispatched to workers, one worker may run faster than another. Then we end of with processing creation first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you understood the question correctly.
Yes I was thinking we could store the UID in the Traceflow "state", and use something like tfState.UID != tf.UID
here. We could probably keep indexing the Traceflow state map on the name. Would more changes be required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit.
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout. The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one. It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned. Signed-off-by: Quan Tian <qtian@vmware.com>
1ef41eb
97be1dd
to
1ef41eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/skip-all |
1 similar comment
/skip-all |
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout. The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one. It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned. Signed-off-by: Quan Tian <qtian@vmware.com>
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout. The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one. It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned. Signed-off-by: Quan Tian <qtian@vmware.com>
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout. The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one. It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned. Signed-off-by: Quan Tian <qtian@vmware.com>
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout. The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one. It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned. Signed-off-by: Quan Tian <qtian@vmware.com>
It may happen that a Traceflow is assigned with a tag that was just released from an old Traceflow but the controller hasn't processed the deletion event of the old Traceflow yet. Previously the controller skipped starting new Traceflow if the tag was already being used, which caused the Traceflow to timeout.
The commit adds a check when determining whether it should start a Traceflow. If the tag is associated with another Traceflow, it will clean it up then start a new trace for the current one.
It also fixes a bug in cleanupTraceflow, which might uninstall flows for another Traceflow if the tag is reassigned.
Fixes #5760
Fixes #5609