[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-discuss] vhost_net: VM looses network when using vhost over time
From: |
Bernd Naumann |
Subject: |
[Qemu-discuss] vhost_net: VM looses network when using vhost over time |
Date: |
Wed, 20 Sep 2017 14:44:54 +0000 (UTC) |
Hi @all,
We have encountered/experience a bug which is more or less reproducible, but we
do not know how to do it exactly or how to debug the issue in the first place.
# Background
In our setup we have a Ganti Cluser (kvm) with atm ~60 nodes running ~500 VMs,
we are using tap interfaces on L2 bridges, L3 routed tap interfaces, and tap
interfaces on a bridge with a VTEP attached to it. (For the vxlan setup we have
a home grown daemon to maintain the FDB).
# The issue
On some VMs we loose network-connectivity under certain/unknown circumstances.
"Looseing" means that the VM is not reachable and can therefor not reach any
other host in the network.
However with `tcpdump` on the host (phy NIC + bridge) we can see the traffic
going in; but with `tcpdump` on the VM we only see arp goes in, but nothing
goes out. Manually setting the ARP entry does not help at all, or only for a
moment, like `ip link set $DEV set arp off; ip link set $DEV arp on`. The only
way we found to "fix" it, is rebooting the VM, or do `modprobe -r virtio_net;
modprobe virtio_net`, but this seams also not the best workaround and can fail
in a short time again. Also it is difficult to determinate when the issue is
kicking in. Counting 'FAILED' neighbors is a indicator but nothing to rely on.
The frequence of the issue ranges from once in a few days, to multiple times
per day or even after some minutes after boot. Most impact we see on VMs with
higher network traffic like our gateway-VMs (multiple NICs in different
networks, IPsec, iptables, ...); ha-proxy-VMs (similar to our gateways), but
also (with reduced frequency) on /normal/ application VMs.
For what we have found so far, it looks like kind of:
* https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978 -- Bug #997978
“KVM images lose connectivity with bridged network” : Bugs : qemu-kvm package :
Ubuntu
* https://bugs.centos.org/view.php?id=5526 -- 0005526: KVM Guest with virtio
network loses network connectivity - CentOS Bug Tracker
Via `rtmon` we can observe that it starts with some "FAILED" neighbor entries
and that they increase over time. As we know that this is only one consequence
of not sending ARP replys to the requester; or that requested ARP is unanswered
(cause the packet is not leaving the VM), the increasing count of 'FAILED'
neighbors is /normal/. BUT: This can start on any interface, bridged tap
interface for WAN, bridged tap in VXLAN, routed tap; it does not matter, or is
not directly linked to the "kind" of interface.
# General overview of the setup
* ganiti-cluster with ~60 nodes
* each node has 2 x 50G (mlnx5 dual-port) connected to 2 x MLNX SN2700 switches
* each node runs `bird` with OSPF and ECMP (and OSPF with ECMP on SN2700 too)
* each VM has one or more vNICs in a bridged or routed network
* networks: bridged tap in WAN; bridged tap with attached VTEP; routed tap
* host OS: Ubuntu 16.04.3 with Ubuntu Kernel 4.12.13; first tested with
qemu-kvm 1:2.5+dfsg-5ubuntu10.15, and later upgraded to qemu-kvm
2.10~rc3+dfsg-0ubuntu1, same issue; guest OS Ubutnu 14.04, Ubuntu 16.04 and
Ubuntu 16.04 with latest Ubuntu mainline kernel PPA
# So far we can "verify" it is 'vhost'
Without "vhost=on" for the kvm process we can not observe this issue. While
using "vhost=on", a effected VM can be "fixed" by `rmmod` and `insmod
virtio_net`, but reboot seams to provide a "fix" for a "longer" period. (But as
you may know, virtio has not the performance we expect.)
So we have some questions:
* How can we debug the main issue to provide a meaningful bug report? Debug
flags on the kernel but where to hang gdb on it? Sadly we are no kernel hackers
:/, but we can compile our own kernel and qemu-kvm to test also release
candidates and/or put patches in place.
* Does someone have seen this too? Can provide a better workaround, or patch or
anything?
* Where to file/reopen this issue? qemu, netdev?
* Is qemu-kvm even the right place to look for answers?
We are happy to provide more information or collect debug information if
someone wants to investigate.
Thanks for your time!
Best,
Bernd Naumann
Spreadshirt
Bernd Naumann
Systems Engineer, Networking & Operations
address@hidden
http://www.spreadshirt.com
sprd.net AG
Gießerstraße 27
D-04229 Leipzig
Fon: +49 341 594 00 - 5900
Fax: +49 341 594 00 - 5149
Vorstand / executive board: Philip Rooke (CEO/Vorsitzender) · Tobias Schaugg
Aufsichtsratsvorsitzender / chairman of the supervisory board: Lukasz Gadowski
Handelsregister / trade register: Amtsgericht Leipzig, HRB 22478
Umsatzsteuer-IdentNummer / VAT-ID: DE 8138 7149 4
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Qemu-discuss] vhost_net: VM looses network when using vhost over time,
Bernd Naumann <=