Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netkvm stops working, connectivity loss #402

Closed
peter-held opened this issue Jul 14, 2019 · 29 comments
Closed

netkvm stops working, connectivity loss #402

peter-held opened this issue Jul 14, 2019 · 29 comments

Comments

@peter-held
Copy link

peter-held commented Jul 14, 2019

Hi,

Starting with qemu-3.1 netkvm driver stops working on my setup, randomly, but very soon after boot on Windows 10. The network is not working anymore and I cannot disable the network interface or driver (waits forever to disable it).

Host OS:
Linux kvm 5.2.0-arch2-1-vfio #1 SMP PREEMPT Thu Jul 11 09:54:19 EEST 2019 x86_64 GNU/Linux

The problem is present also on older kernels (I observed the problem after upgrading to qemu 3.1).

Guest OS:
Windows 10 1809 LTSC (latest updates)

The problem is present also on older Windows versions (I observed the problem after upgrading to qemu 3.1).

Tried with virtio-win 0.1.160, 0.1.164, 0.1.171.

My VM config:

#!/bin/sh

VM=$(basename $0)

KVM_ROOT='/mnt/storage/kvm'
STORAGE_DIR="/dev/zvol/storage/kvm/vms/${VM}"
DEVS='01:00.0,01:00.1,0b:00.0,00:1a.0,0d:00.0,0d:00.1'

$KVM_ROOT/scripts/unbind_devices ${DEVS}

qemu-system-x86_64
-daemonize
-pidfile "/run/qemu-${VM}.pid"
-nodefaults -no-user-config
-name "${VM}",process="${VM}"
-accel kvm
-boot menu=on
-rtc base=localtime,clock=vm,driftfix=slew
-global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1
-no-hpet
-machine pc-i440fx-3.0,accel=kvm,usb=off,vmport=off
-m 8192 -mem-path /dev/hugepages -mem-prealloc
-realtime mlock=off
-cpu host,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,kvm=off,hv-vendor-id=none,hv-crash,hv-reset,hv-vpindex,hv-runtime,hv-synic,hv-stimer
-smp sockets=1,cores=2,threads=2
-drive if=pflash,file="OVMF_CODE-pure-efi.fd",format=raw,readonly
-drive if=pflash,file="OVMF_VARS-pure-efi.fd",format=raw
-debugcon file:ovmf.log -global isa-debugcon.iobase=0x402
-vga none
-nographic
-monitor unix:/run/qemu-${VM}.monitor,server,nowait
-serial none
-parallel none
-chardev socket,id=qga0,path=/run/qemu-${VM}.agent,server,nowait
-device virtio-serial
-device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0

-device vfio-pci,host=01:00.0,bus=pci.0,addr=0x7,multifunction=on,x-no-kvm-intx=on
-device vfio-pci,host=01:00.1,bus=pci.0,addr=0x7.1,x-no-kvm-intx=on
-device vfio-pci,host=0b:00.0,bus=pci.0,x-no-kvm-intx=on
-device vfio-pci,host=00:1a.0,bus=pci.0,x-no-kvm-intx=on
-device vfio-pci,host=0d:00.0,bus=pci.0
-device vfio-pci,host=0d:00.1,bus=pci.0

-usb
-device usb-host,id='Logitech_Inc_G502_Proteus_Spectrum_Optical_Mouse',vendorid=0x046d,productid=0xc332
-device usb-host,id='Microsoft_Corp_Natural_Ergonomic_Keyboard_4000',vendorid=0x045e,productid=0x00db

-netdev tap,id=brlan,ifname=${VM},vhost=on,script=${KVM_ROOT}/scripts/vm_ifup_brlan -device virtio-net-pci,netdev=brlan,mac=52:54:00:00:00:71,ioeventfd=on

-object iothread,id=iothread1
-device virtio-scsi-pci,id=scsi0,iothread=iothread1,num_queues=4

-drive id=drive0,if=none,file="${STORAGE_DIR}/system",format=raw,cache=none,aio=native,discard=unmap,detect-zeroes=on -device scsi-hd,drive=drive0,scsi-id=0,bootindex=2
-drive id=drive1,if=none,file="${STORAGE_DIR}/data",format=raw,cache=none,aio=native,discard=unmap,detect-zeroes=on -device scsi-hd,drive=drive1,scsi-id=1

${OPTS}

Thanks.

@YanVugenfirer
Copy link
Collaborator

Hello Peter,

Can you please share network topology of the host and elaborate on the network setup on the host as well.

Also if you can share ${KVM_ROOT}/scripts/vm_ifup_brlan.

In general, the symptom of the driver failing to unload in case of the network traffic stopped being active indicates that the transmit packet was not returned by the host (just one of the recent cases issue #396).

Thanks,
Yan.

@peter-held
Copy link
Author

peter-held commented Jul 14, 2019

On the host the interface vm is added to a bridge:
brctl show
bridge name bridge id STP enabled interfaces
brlan 8000.8ae1463f3c29 no enp10s0
enp2s0f0
hd1

ip link show
2: enp10s0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master brlan state UP mode DEFAULT group default qlen 1000
link/ether bc:5f:f4:38:a3:3d brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master brlan state UP mode DEFAULT group default qlen 1000
link/ether a0:36:9f:83:53:58 brd ff:ff:ff:ff:ff:ff
9: brlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 8a:e1:46:3f:3c:29 brd ff:ff:ff:ff:ff:ff
16: hd1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc fq_codel master brlan state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 8a:83:38:1e:03:f0 brd ff:ff:ff:ff:ff:ff

ip addr show
2: enp10s0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master brlan state UP group default qlen 1000
link/ether bc:5f:f4:38:a3:3d brd ff:ff:ff:ff:ff:ff
4: enp2s0f0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master brlan state UP group default qlen 1000
link/ether a0:36:9f:83:53:58 brd ff:ff:ff:ff:ff:ff
9: brlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 8a:e1:46:3f:3c:29 brd ff:ff:ff:ff:ff:ff
inet 172.28.10.5/16 brd 172.28.255.255 scope global brlan
valid_lft forever preferred_lft forever
16: hd1: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc fq_codel master brlan state UNKNOWN group default qlen 1000
link/ether 8a:83:38:1e:03:f0 brd ff:ff:ff:ff:ff:ff

cat vm_ifup_brlan
#!/bin/sh

BRIDGE=brlan

echo "Executing $0"

echo "Bringing up $1 for bridged mode..."
ip link set $1 up promisc on

echo "Disabling STP for bridge $BRIDGE"
brctl stp $BRIDGE off

echo "Adding $1 to $BRIDGE ..."
brctl addif $BRIDGE $1

sleep 2

Thanks.

@zaltysz
Copy link

zaltysz commented Jul 29, 2019

Peter,
I have a similar problem: https://bugs.launchpad.net/qemu/+bug/1811533 . Look at "observations" in the bug report. Do they apply to your system too?

@peter-held
Copy link
Author

I don't have the log messages.
Yes, only < than pc-i440fx-3.0 works for me.
I will try without vhost and hv_stimer.

Thanks.

@peter-held
Copy link
Author

Yes, it seems the same problem.

Tested with qemu 4.0:
vhost=off & hv-stimer - working
vhost=on & no hv-stimer - working
vhost=on & hv-stimer - not working

@ybendito
Copy link
Collaborator

ybendito commented Oct 8, 2019

@peter-held Is it possible to check with addition of 'hv-stimer-direct'?

@peter-held
Copy link
Author

Sure,
which qemu version ?

@ybendito
Copy link
Collaborator

ybendito commented Oct 8, 2019

If your qemu version does not contain 'hv-stimer-direct', let's postpone this check for now.
Instead let's try to remove ioeventfd=on for virtio-net-pci or set it to off

@peter-held
Copy link
Author

In this moment I'm using the following config (very stable):

`#!/bin/sh

VM=$(basename $0)

KVM_ROOT='/srv/vms/deps'
STORAGE_DIR="/dev/zvol/rpool/srv/vms/${VM}"

DEVS='01:00.0,01:00.1'

"$KVM_ROOT/scripts/unbind_devices" ${DEVS}

qemu-system-x86_64
-daemonize
-pidfile "/run/qemu-${VM}.pid"
-nodefaults -no-user-config
-name "${VM}",process="${VM}",debug-threads=on
-accel kvm
-boot menu=on
-rtc base=localtime,clock=vm,driftfix=slew
-global kvm-pit.lost_tick_policy=discard -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1
-global pcie-root-port.speed=8
-global pcie-root-port.width=16
-no-hpet
-machine pc-i440fx-3.0,accel=kvm,usb=off,vmport=off
-m 8192 -mem-path /dev/hugepages -mem-prealloc
-realtime mlock=off
-cpu host,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,kvm=off,hv-vendor-id=none,hv-crash,hv-reset,hv-vpindex,hv-runtime,hv-synic,hv-stimer
-smp sockets=1,cores=2,threads=2
-drive if=pflash,file="OVMF_CODE-pure-efi.fd",format=raw,readonly
-drive if=pflash,file="OVMF_VARS-pure-efi.fd",format=raw
-debugcon file:ovmf.log -global isa-debugcon.iobase=0x402
-vga none
-nographic
-monitor unix:"/run/qemu-${VM}.monitor",server,nowait
-serial none
-parallel none
-chardev socket,id=qga0,path="/run/qemu-${VM}.agent",server,nowait
-device virtio-serial
-device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0

-device vfio-pci,host=01:00.0,bus=pci.0,addr=0x7,multifunction=on,x-no-kvm-intx=on
-device vfio-pci,host=01:00.1,bus=pci.0,addr=0x7.1,x-no-kvm-intx=on

-usb
-device usb-host,id='usb_keyboard',vendorid=0x0111,productid=0x1111
-device usb-host,id='usb_mouse',vendorid=0x0111,productid=0x2222

-netdev tap,id=brlan,ifname="${VM}",vhost=on,script="${KVM_ROOT}/scripts/vm_ifup_brlan" -device virtio-net-pci,netdev=brlan,mac=52:54:11:00:00:11,ioeventfd=on

-object iothread,id=iothread1
-device virtio-scsi-pci,id=scsi0,iothread=iothread1,num_queues=4

-drive id=drive0,if=none,file="${STORAGE_DIR}/system",format=raw,cache=none,aio=native,discard=unmap,detect-zeroes=on -device scsi-hd,drive=drive0,scsi-id=0,bootindex=2
-drive id=drive1,if=none,file="${STORAGE_DIR}/data",format=raw,cache=none,aio=native,discard=unmap,detect-zeroes=on -device scsi-hd,drive=drive1,scsi-id=1

${OPTS}`

If I change ioeventfd=off to the current config it still works.

@ybendito
Copy link
Collaborator

ybendito commented Oct 8, 2019

@peter-held Can you please clarify "I'm using the following config (very stable):"?
Does it mean the problem does not happen here despite 'hv-stimer'?

@peter-held
Copy link
Author

peter-held commented Oct 8, 2019

I tested some time ago and wrote in a previous post:

Tested with qemu 4.0:
vhost=off & hv-stimer - working
vhost=on & no hv-stimer - working
vhost=on & hv-stimer - not working

@ybendito
Copy link
Collaborator

@peter-held Can you please clarify: vhost=on & hv-stimer & ioeventfd=off - working or not?

@peter-held
Copy link
Author

Hi,

ioeventfd makes no difference:
vhost=on & hv-stimer & ioeventfd=off - not working
vhost=on & hv-stimer & ioeventfd=on - not working

Thanks.

@ybendito
Copy link
Collaborator

@peter-held Did you try to remove "-mem-path /dev/hugepages -mem-prealloc"?

@peter-held
Copy link
Author

No, why would I want that ? It will slow down the vm.

Using -machine pc-i440fx-3.0, it works well with all the options enabled (vhost, hv-stimer, ioeventfd, hugepages).

If you want me to test some scenario, just tell me the options and I will test.

Thanks.

@ybendito
Copy link
Collaborator

Please clarify again: according to your previous message:
pc-i440fx-3.0, vhost + hv-stimer + huge pages = does not work.
pc-i440fx-3.0, vhost + hv-stimer - huge pages = does?

@peter-held
Copy link
Author

pc-i440fx-3.0, vhost + hv-stimer - huge pages

I have not tested this. I don't want to run the vm without huge pages.
Still, if it is useful for you to fix this bug, I will test.

@ybendito
Copy link
Collaborator

@peter-held I think this test is very important for making progress.

@peter-held
Copy link
Author

If I remove hugepages, then is working fine.

So, the combination that is not working is:
vhost + hv-stimer + huge pages

@ybendito
Copy link
Collaborator

Indeed, so this is more or less similar to https://bugs.launchpad.net/qemu/+bug/1811533 (but in our case happens with pc-i440fx-3.0, and their problem happens with pc-q35_3.1). We will continue investigating the problem.

@peter-held
Copy link
Author

No, the problem is the same, starting with 3.1 (please see my first post).

3.0 is stable.

@ybendito
Copy link
Collaborator

You're right, so I suggest to try "x-hv-synic-kvm-only" with vhost + hv-stimer + huge pages

@peter-held
Copy link
Author

Which qemu version implements this flag ?
Can you please give me a short description of the flag ?

@zaltysz
Copy link

zaltysz commented Oct 17, 2019

x-hv-synic-kvm-only should be added to qemu's -cpu parameter, i.e.: -cpu host,+x-hv-synic-kvm-only,...

It is available since qemu 3.1 and switches synic to qemu 3.0 behavior. So far, this is the cheapest way to workaround this problem.

@peter-held
Copy link
Author

Hi,

yes, if I add 'x-hv-synic-kvm-only', then the problem does not manifest.

Is this a bug of qemu or windows netkvm drivers ?

Thanks.

@ybendito
Copy link
Collaborator

ybendito commented Oct 17, 2019

We do not believe it is somehow related to the windows drivers, seems as complicated conflict involving qemu and host kernel. We'll do our best to contact relevant people.

@peter-held
Copy link
Author

Thank you.

@ybendito
Copy link
Collaborator

closing the issue, please reopen if needed

@ybendito
Copy link
Collaborator

ybendito commented Jan 3, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants