[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: migration: broken snapshot saves appear on s390 when small fields in
From: |
Claudio Fontana |
Subject: |
Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed |
Date: |
Mon, 20 Jul 2020 20:24:11 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 |
I have now been able to reproduce this on X86 as well.
It happens much more rarely, about once every 10 times.
I will sort out the data and try to make it even more reproducible, then post
my findings in detail.
Overall I proceeded as follows:
1) hooked the savevm code to skip all fields with the exception of
"s390-skeys". So only s390-skeys are actually saved.
2) reimplemented "s390-skeys" in a common implementation in cpus.c, used on
both x86 and s390, modeling the behaviour of save/load from hw/s390
3) ran ./check -qcow2 267 on both x86 and s390.
In the case of s390, failure seems to be reproducible 100% of the times.
On X86, it is as mentioned failing about 10% of the times.
Ciao,
Claudio
On 7/16/20 2:58 PM, Claudio Fontana wrote:
> Small update on this,
>
> On 7/15/20 1:10 PM, Claudio Fontana wrote:
>> Hi Thomas,
>>
>> On 7/14/20 4:35 PM, Thomas Huth wrote:
>>> On 14/07/2020 16.29, Claudio Fontana wrote:
>>>> Hello,
>>>>
>>>> I have some tiny progress in narrowing down this issue, possibly a qcow2
>>>> issue, still unclear,
>>>> but involving Kevin Wolf and Max Reitz.
>>>>
>>>>
>>>> The reproducer again:
>>>>
>>>>> --------------------------------------------cut-------------------------------------------
>>>>> diff --git a/cpus.c b/cpus.c
>>>>> index 41d1c5099f..443b88697a 100644
>>>>> --- a/cpus.c
>>>>> +++ b/cpus.c
>>>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>>>>>
>>>>> static bool icount_state_needed(void *opaque)
>>>>> {
>>>>> - return use_icount;
>>>>> + return 0;
>>>>> }
>>>>>
>>>>> static bool warp_timer_state_needed(void *opaque)
>>>>> --------------------------------------------cut-------------------------------------------
>>>>
>>>> This issue for now appears on s390 only:
>>>>
>>>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing
>>>> file part, with broken migration stream data in the s390-skeys vmsave (old
>>>> style).
>>> [...]
>>>> If someone has a good idea let me know - first attempts to reproduce on
>>>> x86 failed, but maybe more work could lead to it.
>>>
>>
>> small update: in the GOOD case (enough padding added) a qcow_merge() is
>> triggered for the last write of 16202 bytes.
>> In the BAD case (not enough padding added) a qcow_merge() is not triggered
>> for the last write of 16201 bytes.
>>
>> Note: manually flushing with qemu_fflush in s390-skeys vmsave also works
>> (maybe got lost in the noise).
>>
>>
>>> Two questions:
>>>
>>> 1) Can you also reproduce the issue manually, without running iotest
>>> 267? ... I tried, but so far I failed.
>>
>> Thanks for the suggestion, will try.
>
> Currently trying to reproduce manually an environment similar to that of the
> test,
> at the moment I am not able to reproduce the issue manually.
>
> Not very familiar with s390,
> I've been running with
>
> export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x
>
> $QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME
>
> where FILENAME is the qcow2 produced by the test.
>
> let me know if you have a suggestion on how to setup up something simple
> properly.
>
>
>>
>>>
>>> 2) Since all the information so far sounds like the problem could be
>>> elsewhere in the code, and the skeys just catch it by accident ... have
>>> you tried running with valgrind? Maybe it catches something useful?
>>
>> Nothing yet, but will fiddle with the options a bit more.
>
> Only thing I have seen so far:
>
>
> +==33321==
> +==33321== Warning: client switching stacks? SP change: 0x1ffeffe5e8 -->
> 0x5d9cf60
> +==33321== to suppress, use: --max-stackframe=137324009096 or greater
> +==33321== Warning: client switching stacks? SP change: 0x5d9cd18 -->
> 0x1ffeffe5e8
> +==33321== to suppress, use: --max-stackframe=137324009680 or greater
> +==33321== Warning: client switching stacks? SP change: 0x1ffeffe8b8 -->
> 0x5d9ce58
> +==33321== to suppress, use: --max-stackframe=137324010080 or greater
> +==33321== further instances of this message will not be shown.
> +==33321== Thread 4:
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321== at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331)
> +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321==
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321== at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331)
> +==33321== by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321== by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321== by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321== by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321== by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321== by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321==
> +==33321==
> +==33321== HEAP SUMMARY:
> +==33321== in use at exit: 2,138,442 bytes in 13,935 blocks
> +==33321== total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes
> allocated
> +==33321==
> +==33321== LEAK SUMMARY:
> +==33321== definitely lost: 0 bytes in 0 blocks
> +==33321== indirectly lost: 0 bytes in 0 blocks
> +==33321== possibly lost: 7,150 bytes in 111 blocks
> +==33321== still reachable: 2,131,292 bytes in 13,824 blocks
> +==33321== suppressed: 0 bytes in 0 blocks
> +==33321== Rerun with --leak-check=full to see details of leaked memory
>
>
>>
>>>
>>> Thomas
>>>
>>
>> Ciao,
>>
>> Claudio
>>
>>
>
> A more interesting update is what follows I think.
>
> I was able to "fix" the problem shown by the reproducer:
>
> @@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@
> static void qemu_account_warp_timer(void)
>
> static bool icount_state_needed(void *opaque)
> {
> - return use_icount;
> + return 0;
> }
>
> by just slowing down qcow2_co_pwritev_task_entry with some tight loops,
> without changing any fields between runs (other than the reproducer icount
> field removal).
>
> I tried to insert the same slowdown just in savevm.c at the end of
> save_snapshot, but that does not work, needs to be in the coroutine.
>
> Thanks,
>
> Claudio
>
>
- migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/12
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Paolo Bonzini, 2020/07/12
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/13
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/14
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Thomas Huth, 2020/07/14
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/15
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/15
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/16
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed,
Claudio Fontana <=
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Claudio Fontana, 2020/07/21
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Bruce Rogers, 2020/07/27
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Vladimir Sementsov-Ogievskiy, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Vladimir Sementsov-Ogievskiy, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Bruce Rogers, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Max Reitz, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Vladimir Sementsov-Ogievskiy, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Paolo Bonzini, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Max Reitz, 2020/07/28
- Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed, Paolo Bonzini, 2020/07/28