qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: recent flakiness (intermittent hangs) of migration-test


From: Peter Xu
Subject: Re: recent flakiness (intermittent hangs) of migration-test
Date: Fri, 30 Oct 2020 09:53:50 -0400

On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > Peter, is it possible that you enable QTEST_LOG=1 in your future 
> > migration-test
> > testcase and try to capture the stderr?  With the help of commit a47295014d
> > ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test 
> > should
> > be able to dump quite some helpful information to further identify the 
> > issue.
> 
> Here's the result of running just the migration test with
> QTEST_LOG=1:
> https://people.linaro.org/~peter.maydell/migration.log
> It's 300MB because when the test hangs one of the processes
> is apparently in a polling state and continues to send status
> queries.
> 
> My impression is that the test is OK on an unloaded machine but
> more likely to fail if the box is doing other things at the
> same time. Alternatively it might be a 'parallel make check' bug.

Thanks for collecting that, Peter.

I'm copy-pasting the important information out here (with some moves and
indents to make things even clearer):

...
{"execute": "migrate-recover", "arguments": {"uri": 
"unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
{"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event": 
"MIGRATION", "data": {"status": "setup"}}
{"return": {}, "id": "recover-cmd"}
{"execute": "query-migrate"}
...
{"execute": "migrate", "arguments": {"resume": true, "uri": 
"unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
qemu-system-x86_64: ram_save_queue_pages no previous block
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
{"return": {}}
{"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth": 
0}}
...

The problem is probably an misuse on last_rb on destination node.  When looking
at it, I also found a race.  So I guess I should fix both...

Peter, would it be easy to try apply the two patches I attached to see whether
the test hang would be resolved?  Dave, feel free to give early comments too on
the two fixes before I post them on the list.

Thanks!

-- 
Peter Xu

Attachment: 0001-migration-Unify-reset-of-last_rb-on-destination-node.patch
Description: Text document

Attachment: 0002-migration-Postpone-the-kick-of-the-fault-thread-afte.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]