[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: recent flakiness (intermittent hangs) of migration-test
From: |
Peter Xu |
Subject: |
Re: recent flakiness (intermittent hangs) of migration-test |
Date: |
Fri, 30 Oct 2020 09:53:50 -0400 |
On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > Peter, is it possible that you enable QTEST_LOG=1 in your future
> > migration-test
> > testcase and try to capture the stderr? With the help of commit a47295014d
> > ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test
> > should
> > be able to dump quite some helpful information to further identify the
> > issue.
>
> Here's the result of running just the migration test with
> QTEST_LOG=1:
> https://people.linaro.org/~peter.maydell/migration.log
> It's 300MB because when the test hangs one of the processes
> is apparently in a polling state and continues to send status
> queries.
>
> My impression is that the test is OK on an unloaded machine but
> more likely to fail if the box is doing other things at the
> same time. Alternatively it might be a 'parallel make check' bug.
Thanks for collecting that, Peter.
I'm copy-pasting the important information out here (with some moves and
indents to make things even clearer):
...
{"execute": "migrate-recover", "arguments": {"uri":
"unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
{"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event":
"MIGRATION", "data": {"status": "setup"}}
{"return": {}, "id": "recover-cmd"}
{"execute": "query-migrate"}
...
{"execute": "migrate", "arguments": {"resume": true, "uri":
"unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
qemu-system-x86_64: ram_save_queue_pages no previous block
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
{"return": {}}
{"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth":
0}}
...
The problem is probably an misuse on last_rb on destination node. When looking
at it, I also found a race. So I guess I should fix both...
Peter, would it be easy to try apply the two patches I attached to see whether
the test hang would be resolved? Dave, feel free to give early comments too on
the two fixes before I post them on the list.
Thanks!
--
Peter Xu
0001-migration-Unify-reset-of-last_rb-on-destination-node.patch
Description: Text document
0002-migration-Postpone-the-kick-of-the-fault-thread-afte.patch
Description: Text document