Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-2077: use syncMap to handle concurrency issues in intf watcher #535

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Jan 28, 2025

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use syncMap instead of regular map
Fixes #524

test cases:

  • multiple agent pods restarts via fc config changes no sign of error/restarts
  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use mutex locking around the map usage

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 28, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use mutex locking around the map usage

Fixes #524

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

New images:
quay.io/netobserv/ebpf-bytecode:53ecb4a
quay.io/netobserv/netobserv-ebpf-agent:53ecb4a

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=53ecb4a make set-agent-image

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use mutex locking around the map usage

Fixes #524

test case:

  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use mutex locking around the map usage

Fixes #524

test cases:

  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use mutex locking around the map usage

Fixes #524

test cases:

  • multiple agent pods restarts via fc config changes no sign of error/restarts
  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 28, 2025
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 28, 2025
@msherif1234 msherif1234 requested a review from jotak January 28, 2025 21:17
Copy link

New images:
quay.io/netobserv/ebpf-bytecode:fbb0399
quay.io/netobserv/netobserv-ebpf-agent:fbb0399

These will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=fbb0399 make set-agent-image

@msherif1234
Copy link
Contributor Author

cpu and memory is about the same or little bit less with no workload
image

@msherif1234 msherif1234 changed the title NETOBSERV-2077: use mutex around map access to handle concurrency NETOBSERV-2077: use syncMap to handle concurrency issues in intf watcher go Jan 28, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 28, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use syncMap instead of regular map
Fixes #524

test cases:

  • multiple agent pods restarts via fc config changes no sign of error/restarts
  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 11.42857% with 31 lines in your changes missing coverage. Please review.

Project coverage is 27.61%. Comparing base (528ec8c) to head (6feda50).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/ifaces/watcher.go 11.42% 30 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #535      +/-   ##
==========================================
- Coverage   27.68%   27.61%   -0.08%     
==========================================
  Files          46       46              
  Lines        4930     4940      +10     
==========================================
- Hits         1365     1364       -1     
- Misses       3463     3473      +10     
- Partials      102      103       +1     
Flag Coverage Δ
unittests 27.61% <11.42%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/ifaces/watcher.go 48.85% <11.42%> (-3.59%) ⬇️

@openshift-ci openshift-ci bot added the lgtm label Jan 29, 2025
@msherif1234 msherif1234 changed the title NETOBSERV-2077: use syncMap to handle concurrency issues in intf watcher go NETOBSERV-2077: use syncMap to handle concurrency issues in intf watcher Jan 29, 2025
@memodi
Copy link
Contributor

memodi commented Jan 30, 2025

no crash seen after running for 22h:

$ oc get pods -n netobserv-privileged
NAME                         READY   STATUS    RESTARTS   AGE
netobserv-ebpf-agent-6j96w   1/1     Running   0          22h
netobserv-ebpf-agent-7lh42   1/1     Running   0          22h
netobserv-ebpf-agent-cmgdh   1/1     Running   0          22h
netobserv-ebpf-agent-h4dv7   1/1     Running   0          22h
netobserv-ebpf-agent-l87fp   1/1     Running   0          22h
netobserv-ebpf-agent-lx5sh   1/1     Running   0          22h

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jan 30, 2025
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 30, 2025

@msherif1234: This pull request references NETOBSERV-2077 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

there is race condition when accessing nsDone map from multiple go routines this PR refactor the code and use syncMap instead of regular map
Fixes #524

test cases:

  • multiple agent pods restarts via fc config changes no sign of error/restarts
  • test FD and netlink leaks
  • ran script to create and delete 10 ns while agent is up
  • bring up 10 netns 1st then bring up the agent

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Jan 30, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@msherif1234 msherif1234 merged commit cbc1c6f into netobserv:main Jan 30, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

concurrency issue when the agent pods is restarted
5 participants