Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

fix up remaining issues with pci hotplug and uevent listening #380

Merged
merged 3 commits into from
Sep 21, 2018

Conversation

bergwolf
Copy link
Member

  1. rescan pci bus before waiting for a new device
  2. fix redefined notify channel when waiting for a new device

Fixes: #372

I made this depends on #375 since they both modify the same function and would naturally conflict.

@bergwolf
Copy link
Member Author

/test

@bergwolf bergwolf requested review from amshinde and sboeuf September 20, 2018 04:08
@codecov
Copy link

codecov bot commented Sep 20, 2018

Codecov Report

Merging #380 into master will increase coverage by 0.04%.
The diff coverage is 33.33%.

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
+ Coverage   46.47%   46.51%   +0.04%     
==========================================
  Files          16       16              
  Lines        2498     2498              
==========================================
+ Hits         1161     1162       +1     
+ Misses       1186     1185       -1     
  Partials      151      151

@jodh-intel
Copy link
Contributor

jodh-intel commented Sep 20, 2018

lgtm

Approved with PullApprove

device.go Outdated
@@ -36,6 +36,7 @@ var (
sysBusPrefix = "/sys/bus/pci/devices"
pciBusPathFormat = "%s/%s/pci_bus/"
systemDevPath = "/dev"
pciBusRescanPath = "/sys/bus/pci/rescan"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is in grpc.go pciBusRescanFile

device.go Outdated
fieldLogger.Info("Device found on pci device path")
return "", nil
// Rescan pci bus if we need to wait for a new pci device
if err = ioutil.WriteFile(pciBusRescanPath, []byte{'1'}, 0200); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use pciBusMode

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could you implement a new function here to rescan the PCI bus and use it in grpc.go ?

amshinde and others added 3 commits September 20, 2018 23:04
…ing"

We should not check for the sysfs path assciated with the PCI address
as the device node may still not be created. Only a uevent with a device
name associated indicates that the device is ready.

This change was also causing the sandbox lock to be not released causing
the uevent listening loop to be locked permanently.

This reverts commit 7caf1c8.

Fixes kata-containers#379

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
So that we make sure the device pci kobject shows up.

Signed-off-by: Peng Tao <bergwolf@gmail.com>
@bergwolf
Copy link
Member Author

@devimc updated. PTAL.

Copy link

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@amshinde
Copy link
Member

@sboeuf PTAL. I think scanning the PCI bus is ok till we find out the qemu issue for SHPC hotplug.

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's fine for now.

@sboeuf
Copy link

sboeuf commented Sep 20, 2018

I've just restarted the metrics build.

@bergwolf
Copy link
Member Author

The metrics failure doesn't seem to relate to this PR. I created kata-containers/tests#763 to track it.

00:32:45 ERROR: command perf not available

@gnawux
Copy link
Member

gnawux commented Sep 21, 2018

Ignore the non-related metrics failure.

let's merge it

@gnawux gnawux merged commit f699f87 into kata-containers:master Sep 21, 2018
@grahamwhaley
Copy link
Contributor

Thanks @gnawux - already aware of the perf item - we have an issue over at kata-containers/ci#43 as well.
But, that is not an actual fail for the metrics right now - it stops us running a test (network performance), that we are not actually using as a pass/fail metric yet (so, yes, it is annoying and distracting that the error occurs, but it is not the actual failure problem). Let me go see if I can find the Jenkins logs to check (as once you have closed the PR, the links vanish from the github view ;-) ).... here we go: http://jenkins.katacontainers.io/job/kata-metrics-agent-ubuntu-16-04-PR/44/console
Wow, metrics was having a bad day :-(
The important bit of the metrics is the bounds check at the end of the logs - first thing to do with a metrics log is to check if we got to the end and did the check happen. Then you find a table like:

03:36:43 Report Summary:
03:36:43 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
03:36:43 | P/F |         NAME         |  FLR  |  MEAN  |  CEIL  |  GAP  |  MIN   |  MAX   | RNG  | COV  | ITS |
03:36:43 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+
03:36:43 | *F* | boot-times           | 95.0% | 110.2% | 105.0% | 10.0% | 108.1% | 112.6% | 4.2% | 1.1% |  20 |
03:36:43 | *F* | memory-footprint     | 95.0% | 87.4%  | 105.0% | 10.0% | 87.4%  | 87.4%  | 0.0% | 0.0% |   1 |
03:36:43 | P   | memory-footprint-ksm | 95.0% | 97.2%  | 105.0% | 10.0% | 97.2%  | 97.2%  | 0.0% | 0.0% |   1 |
03:36:43 +-----+----------------------+-------+--------+--------+-------+--------+--------+------+------+-----+

What I've been seeing over the last week is too much variance in the metrics results, and I'm not sure why yet. I would not normally expect either memory footprint or boot times to vary so much - which is why I have the pass/fail boundaries set at 5% - but we see here boot times were up 10%, and memory footprint down 12.6%. I need to do some research into what is going on. I just didn't find time yet this week.

This is a key reason why the metrics CI's are not marked as 'required' items in Github yet...
Thanks for you patience ;-)

@sboeuf
Copy link

sboeuf commented Sep 21, 2018

@bergwolf @amshinde @egernst @devimc
I have spent some time understand what was going on with pc, q35 and the different ways the PCI hotplug is being performed.

pc case

  • This machine type uses ACPI to hotplug PCI devices. This means the whole notification mechanism relies on ACPI methods and it works well when hotplugging on the main bus (pci host bridge).
  • But why it does not work to hotplug on pci-bridge if shpc=off? The ACPI mechanism should still work, right? Yes, the ACPI mechanism notifying the guest OS about a new PCI device works properly and the PCI slot is getting rescanned. Unfortunately, the PCI spec defines that any PCI bridge that is created (coldplugged) with no device attached to it, will not be getting any memory resource allocated for the futures BARs of its devices. For this reason, an empty PCI bridge is unusable from a PCI hotplug perspective, and the guest OS generates this error:
[    8.424258] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00040000 pref]
[    8.425179] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00040000 pref]
[    8.425274] pci 0000:01:00.0: BAR 4: no space for [mem size 0x00004000 64bit pref]
[    8.425353] pci 0000:01:00.0: BAR 4: failed to assign [mem size 0x00004000 64bit pref]
[    8.425434] pci 0000:01:00.0: BAR 1: no space for [mem size 0x00001000]
[    8.425501] pci 0000:01:00.0: BAR 1: failed to assign [mem size 0x00001000]
[    8.425568] pci 0000:01:00.0: BAR 0: no space for [io  size 0x0020]
[    8.425635] pci 0000:01:00.0: BAR 0: failed to assign [io  size 0x0020]
[    8.472148] PCI Interrupt Link [LNKA] enabled at IRQ 11
[    8.472498] virtio-pci 0000:01:00.0: virtio_pci: leaving for legacy driver
[    8.493270] virtio-pci: probe of 0000:01:00.0 failed with error -12
  • But then, how does it work with shpc=on? Well, by enabling shpc on the PCI bridge, we create a shpc controller on the slot 0 of this PCI bridge, and this will allocate some shpc BAR, which will force the bridge not to be empty, hence usable for later hotplug. Here is the output from the kernel when the PCI device is properly added:
[    8.512248] pci 0000:01:00.0: BAR 6: assigned [mem 0x80000000-0x8003ffff pref]
[    8.512417] pci 0000:01:00.0: BAR 4: assigned [mem 0x80040000-0x80043fff 64bit pref]
[    8.512597] pci 0000:01:00.0: BAR 1: assigned [mem 0x80044000-0x80044fff]
[    8.512795] pci 0000:01:00.0: BAR 0: assigned [io  0xc000-0xc01f]
[    8.535774] PCI Interrupt Link [LNKA] enabled at IRQ 11
[    8.536077] virtio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[    8.558381] virtio_net virtio3 enp1s0: renamed from eth0
[    8.597475] IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
[    9.564367] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready

So that's the explanation for pc and why we need shpc to be enabled on any pci-bridge that we create.

Note: coldplugging a PCI device to the PCI bridge will have the same effect, forcing the PCI bridge to have it's own BAR space, and further hotplug will work (unless you're hotplugging a PCI device with large BARs).

q35 case

  • There is no ACPI involved in the handling of PCI hotplug for the case of q35. It's either relying on shpc (for PCI devices) or PCIe native hotplug.
  • Because we use a pci-bridge, shpc is going to be used in this case. But it will be used fully, not only for the "PCI bridge trick". It will provide the mechanism to communicate with the guest OS and that's why the guest OS needs to support it (that's why we enabled CONFIG_HOTPLUG_PCI_SHPC).
  • Problem is about the way shpc works and the assumptions from the shpc driver drivers/pci/hotplug/shpchp_ctrl.c. When the driver detects the device has been hotplugged, it will eventually call into handle_button_press_event() which will delay queuing the work for 5 seconds.... Yeah 5 seconds! We can understand this from a real HW perspective, but for virtualization, it does not make sense... Unless we patch the driver itself (and I don't think we should do that), the PCI rescan looks like the best optimization we can use here. Indeed, the PCI device is already there, but we're simply waiting for the driver to resume its work.

I hope this helps clarify things :)

@amshinde
Copy link
Member

@sboeuf Thanks for summarising our findings. In conclusion:
It always makes sense to leave SHPC on for a pci-bridge.

With the current hotplug implementation for q35 in virtcontainers with pci-bridge, we need to rescan the pci bus because of the 5 second delay in the shpc controller.
Since this is not required for pc, we can do better and pass an extra parameter to the agent telling it to scan the bus or not. I am not sure how expensive scanning the bus would get, when large number of devices are attached.

Another point to consider with q35 hotplug is that by attaching pcie devices to a pci-bridge we lose out on the pcie capabilities of the device, since the device is exposed as a pci device. We lose out on all the advantages of pcie which we would get with q35 platform. Hence, what we really need to do is to revisit q35 hotplug in virtcontainers and implement the hotplug using pcie root ports instead of a pci-bridge.

cc @devimc

@bergwolf
Copy link
Member Author

@sboeuf Thanks for the detailed summary! And I agree with @amshinde that we should move to pcie ports for q35 since pcie support is the main advantage of q35 platform.

@gnawux
Copy link
Member

gnawux commented Sep 22, 2018

thanks @sboeuf , agree to move to pcie as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uevent listening goroutine is not (always) working
7 participants