Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Unbind/Bind VF in a serial manner. #30

Merged

Conversation

adrianchiris
Copy link
Collaborator

@adrianchiris adrianchiris commented May 10, 2020

Unbind/Bind VF as well as moving RDMA device to namespace
causes rdma resources to be re-created.
CNI may be invoked in parallel and kernel may provide a VF
RDMA resources under a different name.

As the mapping of RDMA resources is done in Device plugin
prior to CNI invocation, it must not change during CNI invocation.

We serialize the CNI's Add,Del operations by Creating/Locking/Unlocking
a file, causing kernel to allocate RDMA resources the same name
for a given VF.

In the future, Systems should use udev PCI based RDMA device
names, ensuring consistent RDMA resources names.

@adrianchiris
Copy link
Collaborator Author

Missing testing on a real system

@coveralls
Copy link

coveralls commented May 10, 2020

Pull Request Test Coverage Report for Build 72

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 42.715%

Totals Coverage Status
Change from base Build 95: 0.0%
Covered Lines: 214
Relevant Lines: 501

💛 - Coveralls

@adrianchiris adrianchiris force-pushed the serialize-unbind-bind-op branch from dad1d62 to 4bf4694 Compare May 10, 2020 16:17
Unbind/Bind VF as well as moving RDMA device to namespace
causes rdma resources to be re-created.
CNI may be invoked in parallel and kernel may provide a VF
RDMA resources under a different name.

As the mapping of RDMA resources is done in Device plugin
prior to CNI invocation, it must not change during CNI invocation.

We serialize the CNI's Add,Del operations by Creating/Locking/Unlocking
a file, causing kernel to allocate RDMA resources the same name
for a given VF.

In the future, Systems should use udev PCI based RDMA device
names, ensuring consistent RDMA resources names.
@adrianchiris adrianchiris force-pushed the serialize-unbind-bind-op branch from 4bf4694 to 7de7717 Compare May 11, 2020 11:00
@@ -82,6 +108,14 @@ func cmdAdd(args *skel.CmdArgs) error {
}

sm := sriov.NewSriovManager()

// Lock CNI operation to serialize the operation
lock, err := lockCNIExecution()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be noted, that while further optimizations where possible both here and in CmdDel (unlock eralier than on first error or end of execution)
it would have introduced (imo) un-needed code complexity

Copy link
Collaborator

@moshe010 moshe010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about catching signals to remove the lock?
Did you check what will happen if you put sleep of 5 min in the CNI? will kubernetes kill it?

@@ -230,6 +264,13 @@ func cmdDel(args *skel.CmdArgs) error {
}
defer netns.Close()

// Lock CNI operation to serialize the operation
lock, err := lockCNIExecution()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already locked from line 113

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, ill remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this again, i originally thought that i mistakenly locked twice. but this is not the case.
this lock is for cmdDel and on line 113 its for cmdAdd

@moshe010 moshe010 changed the title [WIP] Run Unbind/Bind VF in a serial manner. Run Unbind/Bind VF in a serial manner. May 12, 2020
@moshe010
Copy link
Collaborator

please fix lint errors

This is required to release the lock in case of a signal sent to
CNI.

Note: While operating systems are supposed to free a locked file
if the PID of the locking process does not exist, they may not
do so promptly.
@adrianchiris adrianchiris force-pushed the serialize-unbind-bind-op branch from 205901f to f842f5f Compare May 12, 2020 17:58
@adrianchiris
Copy link
Collaborator Author

forgot to push :\ it should pass now

@moshe010 moshe010 merged commit a6ffd9f into k8snetworkplumbingwg:master May 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants