Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

device: Do not rescan PCI bus #782

Closed

Conversation

devimc
Copy link

@devimc devimc commented May 7, 2020

PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Depends-on: github.com/kata-containers/runtime#2670
fixes #781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes julio.montes@intel.com

@fidencio
Copy link
Member

fidencio commented May 7, 2020

/test-vfio

devimc pushed a commit to devimc/kata-runtime that referenced this pull request May 7, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

Depends-on: github.com/kata-containers/agent#782
fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@devimc devimc force-pushed the topic/agent/noRescanPCI branch from 5b1ddf6 to 379f833 Compare May 7, 2020 20:43
@devimc devimc added the do-not-merge PR has problems or depends on another label May 7, 2020
@devimc
Copy link
Author

devimc commented May 7, 2020

/test

@codecov
Copy link

codecov bot commented May 7, 2020

Codecov Report

Merging #782 into master will decrease coverage by 0.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #782      +/-   ##
==========================================
- Coverage   60.15%   60.06%   -0.09%     
==========================================
  Files          17       17              
  Lines        2665     2672       +7     
==========================================
+ Hits         1603     1605       +2     
- Misses        900      906       +6     
+ Partials      162      161       -1     

devimc pushed a commit to devimc/kata-runtime that referenced this pull request May 8, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

Depends-on: github.com/kata-containers/agent#782
fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this looks quite interesting.

We're being hit by a workaround of an old QEMU bug. Not exactly related to this PR, but we really should document the workarounds for this and that version of the packages we rely on, mainly to revisit that in the future when we bump our minimum requirements.

Code wise, it looks good. Conceptual wise, it looks good.

I'll add the "Approve" once @amorenoz mentions his tests are all good, introducing the "AaaS" concept ("Ack-as-a-Service").

@devimc devimc removed the do-not-merge PR has problems or depends on another label May 8, 2020
@devimc devimc force-pushed the topic/agent/noRescanPCI branch from 379f833 to ebd5b2a Compare May 8, 2020 15:58
@devimc
Copy link
Author

devimc commented May 8, 2020

/test

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @devimc - the best type of PR 😄

Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on both @amorenoz and @bpradipt feedback!

@devimc
Copy link
Author

devimc commented May 13, 2020

/test-vfio

@amshinde
Copy link
Member

amshinde commented May 13, 2020

@devimc You mentioned that PCI bus rescan code was added long time ago in Clear Containers due to lack of ACPI support in QEMU 2.9 + q35 [1]. . Can you point me to the change where this was fixed.

Afaik, for q35 we hotplug devices on a pci-to-pci bridge and this mechanism makes use of SHPC itself and does not use ACPI at all.
See this comment from @sboeuf on an older issue : #380 (comment)
As you can see, the rescan was added to overcome the 5sec delay seen with SHPC.
I am not sure if things have changed this since, but if the 5 sec delay is no longer been seen(either the driver has been fixed or ACPI is being used), then I am good with this change.
I'll let you confirm @devimc.

cc @fidencio @amorenoz

(This is the original commit that added the rescan logic in Kata : #380)

@amorenoz
Copy link

@amshinde As far as I have seen, the 5s are still there. However, the rescan systematically breaks the pci devices due to a race condition between the shpc and the rescan. I don't see a clear way of avoiding that race that does not defeat the very same reason the rescan was added.

I know that the use of pcie-root-ports should be the preferred way in q35 machine-types but even there the rescan (which, IIUC, is not needed in that case) breaks the pcie hotplug.

@dgibson
Copy link
Contributor

dgibson commented Sep 3, 2020

Note that both shpc and PCI-E native hotplug have a delay (though I think the PCI-E one is a little shorter). Despite adding a delay, I think applying this patch is the correct thing to do:

As well as badly breaking devices because of the race, the rescan only ever improved the delay by accident. It's really a kernel bug that the rescan and hotplug break so badly. I've spoken to Alex Williamson (kernel VFIO maintainer) and he says it's a known problem, but considered a low priority, so this is unlikely to be fixed any time soon.

AFAICT, if the kernel bug were fixed it would likely mean that either:

  • The rescan would block waiting for the hotplug to complete, so we'd still get the 5s delay, or,
  • The rescan would complete without the new device showing up, so we'd still need to wait for the hotplug delay to actually have the device

So even if there wasn't the bad failure mode, the rescan still wouldn't accomplish what we want.

If we want to properly address that hotplug delay, we'd need to define a new hotplug protocol and teach the guest kernel and qemu to use it. Given that the 5s delay is there for the benefit of a human pressing buttons and watching blinkenlights when physically hotplugging a device, a faster virtual-only hotplug protocol would actually make a lot of sense for a bunch of applications, but it's not something we can accomplish in Kata alone.

So despite the extra delay, I think we should go ahead and apply this fix.

@bpradipt
Copy link
Contributor

bpradipt commented Sep 3, 2020

Note that both shpc and PCI-E native hotplug have a delay (though I think the PCI-E one is a little shorter). Despite adding a delay, I think applying this patch is the correct thing to do:

As well as badly breaking devices because of the race, the rescan only ever improved the delay by accident. It's really a kernel bug that the rescan and hotplug break so badly. I've spoken to Alex Williamson (kernel VFIO maintainer) and he says it's a known problem, but considered a low priority, so this is unlikely to be fixed any time soon.

AFAICT, if the kernel bug were fixed it would likely mean that either:

  • The rescan would block waiting for the hotplug to complete, so we'd still get the 5s delay, or,
  • The rescan would complete without the new device showing up, so we'd still need to wait for the hotplug delay to actually have the device

So even if there wasn't the bad failure mode, the rescan still wouldn't accomplish what we want.

If we want to properly address that hotplug delay, we'd need to define a new hotplug protocol and teach the guest kernel and qemu to use it. Given that the 5s delay is there for the benefit of a human pressing buttons and watching blinkenlights when physically hotplugging a device, a faster virtual-only hotplug protocol would actually make a lot of sense for a bunch of applications, but it's not something we can accomplish in Kata alone.

So despite the extra delay, I think we should go ahead and apply this fix.

@dgibson thanks for the explanation. I wonder if there is a good place to document these for any new contributors to understand the finer nuances. May be as a code comment or commit message ?

@dgibson
Copy link
Contributor

dgibson commented Sep 4, 2020

We could certainly update the commit message with that context. I don't see an obvious place to put it in the code, though.

@c3d
Copy link
Member

c3d commented Sep 4, 2020

I agree we want to have this change in.

@dgibson you mention a native delay. But with the current qemu on q35, do we still have this delay?

@dgibson
Copy link
Contributor

dgibson commented Sep 4, 2020

@dgibson you mention a native delay. But with the current qemu on q35, do we still have this delay?

The delay isn't in qemu, it's in the guest kernel.

PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Depends-on: github.com/kata-containers/runtime#2670
fixes kata-containers#781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@devimc devimc force-pushed the topic/agent/noRescanPCI branch from ebd5b2a to b26f728 Compare September 4, 2020 12:58
@devimc
Copy link
Author

devimc commented Sep 4, 2020

PR rebased, but VFIO CI will fail

/test-vfio

@c3d
Copy link
Member

c3d commented Sep 4, 2020

@dgibson you mention a native delay. But with the current qemu on q35, do we still have this delay?

The delay isn't in qemu, it's in the guest kernel.

Do you mean the kernel implements the same delay required for physical hardware (or human operators) irrespective of whether it's a physical or virtual device?

@dgibson
Copy link
Contributor

dgibson commented Sep 5, 2020

@dgibson you mention a native delay. But with the current qemu on q35, do we still have this delay?

The delay isn't in qemu, it's in the guest kernel.

Do you mean the kernel implements the same delay required for physical hardware (or human operators) irrespective of whether it's a physical or virtual device?

Well, from the guest kernel's point of view a virtual device is (at least theoretically) indistinguishable from a physical one.

Plus I believe the delay is written into the PCI specs, and presumably predates the widespread use of virtual devices.

@jodh-intel jodh-intel added the port-to-2.0 PRs that need to be ported to kata 2.0-dev branch label Sep 7, 2020
@jodh-intel
Copy link
Contributor

@devimc - I think this needs to be ported to 2.0, but please could you verify?

@c3d
Copy link
Member

c3d commented Sep 7, 2020

@dgibson you mention a native delay. But with the current qemu on q35, do we still have this delay?

The delay isn't in qemu, it's in the guest kernel.

Do you mean the kernel implements the same delay required for physical hardware (or human operators) irrespective of whether it's a physical or virtual device?

Well, from the guest kernel's point of view a virtual device is (at least theoretically) indistinguishable from a physical one.

True, but there is also paravirtualization all over the place. Maybe the kernel could relax the historical PCI delays when running in a VM.

Plus I believe the delay is written into the PCI specs, and presumably predates the widespread use of virtual devices.

Thanks for answering my question.

@c3d
Copy link
Member

c3d commented Sep 7, 2020

@devimc - I think this needs to be ported to 2.0, but please could you verify?

The rust agent does indeed contain PCI rescan code.

c3d added a commit to c3d/kata-containers that referenced this pull request Sep 7, 2020
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

Forward port of: kata-containers/agent#782
Fixes: kata-containers#683
Suggested-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
c3d added a commit to c3d/kata-containers that referenced this pull request Sep 7, 2020
PCI bus rescan code was added long time ago in Clear Containers due to
lack of ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing
up PCIe hotplug in Kata Containers. A workaround to this issue is the
"lazy attach" mechanism [2] that hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, unfourtunately some non-LBS
devices are being affected too, for instance SR-IOV devices. It would
not make sense to lazy-attach non-LBS devices because kata will end up
lazy-attaching all the devices, having said that, the PCI bus rescan
code and the "lazy attach" mechanism should be removed

Forward port of: kata-containers/agent#782
Fixes: kata-containers#683
Suggested-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
@devimc
Copy link
Author

devimc commented Sep 7, 2020

@c3d

The rust agent does indeed contain PCI rescan code.

but we don't have a vfio CI in 2.0,

@c3d
Copy link
Member

c3d commented Sep 7, 2020

@c3d

The rust agent does indeed contain PCI rescan code.

but we don't have a vfio CI in 2.0,

@devimc: So let's wait until the go version passes?

Let's just make sure we don't forget about this later.

@devimc
Copy link
Author

devimc commented Sep 7, 2020

@c3d it won't pass until we reimplement the way hotplugged devices are handled in both runtime and agent

@dgibson
Copy link
Contributor

dgibson commented Sep 7, 2020

Well, from the guest kernel's point of view a virtual device is (at least theoretically) indistinguishable from a physical one.

True, but there is also paravirtualization all over the place. Maybe the kernel could relax the historical PCI delays when running in a VM.

Uh.. possibly. Working out criteria to do that safely will require some thought, though.

@bpradipt
Copy link
Contributor

/test-ubuntu

Copy link
Contributor

@dgibson dgibson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed the conversation, I think this is a necessary change, despite the extra delays.

@jodh-intel
Copy link
Contributor

VFIO CI failing:

14:23:45 + sudo kata-runtime --kata-config /tmp/tmp.koPBhUVcVY/configuration.toml run --detach -b /tmp/tmp.koPBhUVcVY/bundle --pid-file=/tmp/tmp.koPBhUVcVY/pid vfiotest
14:23:46 + check_eth_dev
14:23:46 + sudo kata-runtime --kata-config /tmp/tmp.koPBhUVcVY/configuration.toml exec vfiotest ip a
14:23:46 + grep eth
14:23:46 ++ handle_error 92
14:23:46 ++ local exit_code=1
14:23:46 Failed at 92: grep "eth"

@devimc - it might be useful to modify check_eth_dev() to save the output of ip a before doing the grep to aid with debugging.

@dgibson
Copy link
Contributor

dgibson commented Sep 17, 2020

Ok, I had a look at the tests code, and I'm pretty sure what's happening is that it's looking for the eth interface from the VFIO device more-or-less immediately after starting the container, whereas the hotplug will take several seconds. The rescan we used to have not only probed the device sooner, but it's synchronous, so it had the side effect of waiting for the scan to complete which meant a bunch of things were already set up once we started the container.

We have a couple of options:

  1. Put a delay or poll/timeout into the test case
  2. Add code to the agent to explicitly wait for the devices. I have some code in Allow VFIO devices to be passed through as VFIO devices #845 to do that. Currently it's only for the case of devices we want to appear as VFIO devices within the guest as well, but I could probably disconnect the parts. It will still be a little bit dodgy, because although we can wait for the PCI devices to appear, we don't know what interfaces or secondary devices will appear from it when attached to the guest driver, so we can't wait for those.

@jodh-intel, thoughts?

@jodh-intel
Copy link
Contributor

Add code to the agent to explicitly wait for the devices.

@dgibson - that sounds like the best option to me, since it's the least surprising.

It will still be a little bit dodgy, because although we can wait for the PCI devices to appear, we don't know what interfaces or secondary devices will appear from it when attached to the guest driver, so we can't wait for those.

That's a pita. @amshinde / @mcastelino - any thoughts on how we might handle this reliably?

One other consideration - whatever we decide needs to be implementable for the 2.0 agent at some future date (ideally before 2.0 :)

@dgibson
Copy link
Contributor

dgibson commented Sep 17, 2020

It will still be a little bit dodgy, because although we can wait for the PCI devices to appear, we don't know what interfaces or secondary devices will appear from it when attached to the guest driver, so we can't wait for those.

That's a pita. @amshinde / @mcastelino - any thoughts on how we might handle this reliably?

I should clarify: while it's true that we can't be sure that the expected secondary devices have appeared, I think that's also true of the existing rescan.

One other consideration - whatever we decide needs to be implementable for the 2.0 agent at some future date (ideally before 2.0 :)

Yes, I haven't looked at porting my patches to 2.0 yet, but it's on my list.

@jodh-intel jodh-intel added the needs-forward-port Changes need to be applied to a newer branch / repository label Oct 2, 2020
@jodh-intel
Copy link
Contributor

@devimc - please can you rebase as the branch is now conflicted?

Forward port PR: kata-containers/kata-containers#684.

@jodh-intel
Copy link
Contributor

@devimc - oh, or should this be closed in favour of #850?

@devimc
Copy link
Author

devimc commented Oct 2, 2020

closing in favour of #850 - thanks @jodh-intel

@devimc devimc closed this Oct 2, 2020
bpradipt pushed a commit to bpradipt/runtime that referenced this pull request Nov 27, 2020
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

fixes kata-containers#2664

[1] kata-containers#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-forward-port Changes need to be applied to a newer branch / repository port-to-2.0 PRs that need to be ported to kata 2.0-dev branch
Projects
None yet
8 participants