[Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events

From:	Juan Quintela
Subject:	[Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events
Date:	Wed, 14 Aug 2019 04:02:18 +0200

When we have lots of channels, sometimes multifd migration fails
with the following error:

(qemu) migrate -d tcp:0:4444
(qemu) qemu-system-x86_64: multifd_send_pages: channel 17 has already quit!
qemu-system-x86_64: multifd_send_pages: channel 17 has already quit!
qemu-system-x86_64: multifd_send_sync_main: multifd_send_pages fail
qemu-system-x86_64: Unable to write to socket: Connection reset by peer
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: 
off return-path: off pause-before-switchover: off multifd: on dirty-bitmaps: 
off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off
Migration status: failed (Unable to write to socket: Connection reset by peer)
total time: 0 milliseconds

On this particular example I am using 100 channels.  The bigger the
number of channels, the easier that it is to reproduce.  That don't
mean that it is a good idea to use so many channels.

With the previous patches on this series, I can run "reliabely" on my
hardware with until 10 channels.  Most of the time.  Until it fails.
With 100 channels, it fails almost always.

I thought that the problem was on the send side, so I tried to debug
there.  As you can see for the delay, if you put any
printf()/error_report/trace, you can get that the error goes away, it
is very timing sensitive.  With a delay of 10000 microseconds, it only
works sometimes.

What have I discovered so far:

- send side calls qemu_socket() on all the channels.  So it appears
  that it gets created correctly.
- on the destination side, it appears that "somehowe" some of the
  connections are lost by the listener.  This error happens when the
  destination side socket hasn't been "accepted", and it is not
  properly created.  As far as I can see, we have several options:

  1- I don't know how to use properly qio asynchronously
     (this is one big posiblity).

  2- glib has one error in this case?  or how qio listener is
     implemented on top of glib.  I put lots of printf() and other
     instrumentation, and it appears that the listener io_func is not
     called at all for the connections that are missing.

  3- it is always possible that we are missing some g_main_loop_run()
     somewhere.  Notice how test/test-io-channel-socket.c calls it
     "creatively".

  4- It is enterely possible that I should be using the sockets as
     blocking instead of non-blocking.  But I am not sure about that
     one yet.

- on the sending side, what happens is:

  eventually it call socket_connect() after all the async dance with
  thread creation, etc, etc. Source side creates all the channels, it
  is the destination side which is missing some of them.

  sending side sends the first packet by that channel, it "sucheeds"
  and didn't give any error.

  after some time, sending side decides to send another packet through
  that channel, and it is now when we get the above error.

Any good ideas?

Later, Juan.

PD: Command line used is attached:

Imortant bits:
- multifd is set
- multifd_channels is set to 100

/scratch/qemu/fail/x64/x86_64-softmmu/qemu-system-x86_64 -M
pc-i440fx-3.1,accel=kvm,usb=off,vmport=off,nvdimm -L
/mnt/code/qemu/check/pc-bios/ -smp 2 -name t1,debug-threads=on -m 3G
-uuid 113100f9-6c99-4a7a-9b78-eb1c088d1087 -monitor stdio -boot
strict=on -drive
file=/mnt/images/test.img,format=qcow2,if=none,id=disk0 -device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=disk0,id=virtio-disk0,bootindex=1
-netdev tap,id=hostnet0,script=/etc/kvm-ifup,downscript= -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:9d:10:51,bus=pci.0,addr=0x3
-serial pty -parallel none -usb -device usb-tablet -k es -vga cirrus
--global migration.x-multifd=on --global
migration.multifd-channels=100 -trace events=/home/quintela/tmp/events

CC: Daniel P. Berrangé <address@hidden>

Signed-off-by: Juan Quintela <address@hidden>
---
 migration/ram.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/ram.c b/migration/ram.c
index 25a211c3fb..50586304a0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1248,6 +1248,7 @@ int multifd_save_setup(void)
         p->packet = g_malloc0(p->packet_len);
         p->name = g_strdup_printf("multifdsend_%d", i);
         socket_send_channel_create(multifd_new_send_channel_async, p);
+        usleep(100000);
     }
     return 0;
 }
-- 
2.21.0

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 2/6] migration: Make global sem_sync semaphore by channel, (continued)
- [Qemu-devel] [PATCH 2/6] migration: Make global sem_sync semaphore by channel, Juan Quintela, 2019/08/13
  - Re: [Qemu-devel] [PATCH 2/6] migration: Make global sem_sync semaphore by channel, Dr. David Alan Gilbert, 2019/08/14
- [Qemu-devel] [PATCH 3/6] migration: Make sure that all multifd channels have been created, Juan Quintela, 2019/08/13
  - Re: [Qemu-devel] [PATCH 3/6] migration: Make sure that all multifd channels have been created, Dr. David Alan Gilbert, 2019/08/14
    - Re: [Qemu-devel] [PATCH 3/6] migration: Make sure that all multifd channels have been created, Juan Quintela, 2019/08/19
- [Qemu-devel] [PATCH 4/6] migration: Make multifd threads wait until all have been created, Juan Quintela, 2019/08/13
- [Qemu-devel] [PATCH 5/6] migration: add some multifd traces, Juan Quintela, 2019/08/13
  - Re: [Qemu-devel] [PATCH 5/6] migration: add some multifd traces, Dr. David Alan Gilbert, 2019/08/14
  - Re: [Qemu-devel] [PATCH 5/6] migration: add some multifd traces, Philippe Mathieu-Daudé, 2019/08/19
    - Re: [Qemu-devel] [PATCH 5/6] migration: add some multifd traces, Philippe Mathieu-Daudé, 2019/08/19
- [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Juan Quintela <=
  - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Daniel P . Berrangé, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Juan Quintela, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Daniel P . Berrangé, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Juan Quintela, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Daniel P . Berrangé, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Juan Quintela, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Peter Maydell, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Daniel P . Berrangé, 2019/08/19
    - Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events, Juan Quintela, 2019/08/19
- Re: [Qemu-devel] [PATCH 0/6] Fix multifd with big number of channels, no-reply, 2019/08/14

Prev by Date: [Qemu-devel] [PATCH 5/6] migration: add some multifd traces
Next by Date: Re: [Qemu-devel] [PATCH v9 05/11] numa: Extend CLI to provide initiator information for numa nodes
Previous by thread: Re: [Qemu-devel] [PATCH 5/6] migration: add some multifd traces
Next by thread: Re: [Qemu-devel] [PATCH 6/6] RFH: We lost "connect" events
Index(es):
- Date
- Thread