qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: migration: broken snapshot saves appear on s390 when small fields in


From: Claudio Fontana
Subject: Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed
Date: Mon, 20 Jul 2020 20:24:11 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1

I have now been able to reproduce this on X86 as well.

It happens much more rarely, about once every 10 times.

I will sort out the data and try to make it even more reproducible, then post 
my findings in detail.

Overall I proceeded as follows:

1) hooked the savevm code to skip all fields with the exception of 
"s390-skeys". So only s390-skeys are actually saved.

2) reimplemented "s390-skeys" in a common implementation in cpus.c, used on 
both x86 and s390, modeling the behaviour of save/load from hw/s390

3) ran ./check -qcow2 267 on both x86 and s390.

In the case of s390, failure seems to be reproducible 100% of the times.
On X86, it is as mentioned failing about 10% of the times.

Ciao,

Claudio


On 7/16/20 2:58 PM, Claudio Fontana wrote:
> Small update on this,
> 
> On 7/15/20 1:10 PM, Claudio Fontana wrote:
>> Hi Thomas,
>>
>> On 7/14/20 4:35 PM, Thomas Huth wrote:
>>> On 14/07/2020 16.29, Claudio Fontana wrote:
>>>> Hello,
>>>>
>>>> I have some tiny progress in narrowing down this issue, possibly a qcow2 
>>>> issue, still unclear,
>>>> but involving Kevin Wolf and Max Reitz.
>>>>
>>>>
>>>> The reproducer again:
>>>>
>>>>> --------------------------------------------cut-------------------------------------------
>>>>> diff --git a/cpus.c b/cpus.c
>>>>> index 41d1c5099f..443b88697a 100644
>>>>> --- a/cpus.c
>>>>> +++ b/cpus.c
>>>>> @@ -643,7 +643,7 @@ static void qemu_account_warp_timer(void)
>>>>>  
>>>>>  static bool icount_state_needed(void *opaque)
>>>>>  {
>>>>> -    return use_icount;
>>>>> +    return 0;
>>>>>  }
>>>>>  
>>>>>  static bool warp_timer_state_needed(void *opaque)
>>>>> --------------------------------------------cut-------------------------------------------
>>>>
>>>> This issue for now appears on s390 only:
>>>>
>>>> On s390 hardware, test 267 fails (both kvm and tcg) in the qcow2 backing 
>>>> file part, with broken migration stream data in the s390-skeys vmsave (old 
>>>> style).
>>> [...]
>>>> If someone has a good idea let me know - first attempts to reproduce on 
>>>> x86 failed, but maybe more work could lead to it.
>>>
>>
>> small update: in the GOOD case (enough padding added) a qcow_merge() is 
>> triggered for the last write of 16202 bytes.
>> In the BAD case (not enough padding added) a qcow_merge() is not triggered 
>> for the last write of 16201 bytes.
>>
>> Note: manually flushing with qemu_fflush in s390-skeys vmsave also works 
>> (maybe got lost in the noise).
>>
>>
>>> Two questions:
>>>
>>> 1) Can you also reproduce the issue manually, without running iotest
>>> 267? ... I tried, but so far I failed.
>>
>> Thanks for the suggestion, will try.
> 
> Currently trying to reproduce manually an environment similar to that of the 
> test,
> at the moment I am not able to reproduce the issue manually.
> 
> Not very familiar with s390,
> I've been running with 
> 
> export QEMU=/home/cfontana/qemu-build/s390x-softmmu/qemu-system-s390x
> 
> $QEMU -nographic -monitor stdio -nodefaults -no-shutdown FILENAME
> 
> where FILENAME is the qcow2 produced by the test.
> 
> let me know if you have a suggestion on how to setup up something simple 
> properly.
> 
> 
>>
>>>
>>> 2) Since all the information so far sounds like the problem could be
>>> elsewhere in the code, and the skeys just catch it by accident ... have
>>> you tried running with valgrind? Maybe it catches something useful?
>>
>> Nothing yet, but will fiddle with the options a bit more.
> 
> Only thing I have seen so far:
> 
> 
> +==33321== 
> +==33321== Warning: client switching stacks?  SP change: 0x1ffeffe5e8 --> 
> 0x5d9cf60
> +==33321==          to suppress, use: --max-stackframe=137324009096 or greater
> +==33321== Warning: client switching stacks?  SP change: 0x5d9cd18 --> 
> 0x1ffeffe5e8
> +==33321==          to suppress, use: --max-stackframe=137324009680 or greater
> +==33321== Warning: client switching stacks?  SP change: 0x1ffeffe8b8 --> 
> 0x5d9ce58
> +==33321==          to suppress, use: --max-stackframe=137324010080 or greater
> +==33321==          further instances of this message will not be shown.
> +==33321== Thread 4:
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321==    at 0x3AEC70: process_queued_cpu_work (cpus-common.c:331)
> +==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321== 
> +==33321== Conditional jump or move depends on uninitialised value(s)
> +==33321==    at 0x3AEC74: process_queued_cpu_work (cpus-common.c:331)
> +==33321==    by 0x2753E1: qemu_wait_io_event_common (cpus.c:1213)
> +==33321==    by 0x2755CD: qemu_wait_io_event (cpus.c:1253)
> +==33321==    by 0x27596D: qemu_dummy_cpu_thread_fn (cpus.c:1337)
> +==33321==    by 0x725C87: qemu_thread_start (qemu-thread-posix.c:521)
> +==33321==    by 0x4D504E9: start_thread (in /lib64/libpthread-2.26.so)
> +==33321==    by 0x4E72BBD: ??? (in /lib64/libc-2.26.so)
> +==33321== 
> +==33321== 
> +==33321== HEAP SUMMARY:
> +==33321==     in use at exit: 2,138,442 bytes in 13,935 blocks
> +==33321==   total heap usage: 19,089 allocs, 5,154 frees, 5,187,670 bytes 
> allocated
> +==33321== 
> +==33321== LEAK SUMMARY:
> +==33321==    definitely lost: 0 bytes in 0 blocks
> +==33321==    indirectly lost: 0 bytes in 0 blocks
> +==33321==      possibly lost: 7,150 bytes in 111 blocks
> +==33321==    still reachable: 2,131,292 bytes in 13,824 blocks
> +==33321==         suppressed: 0 bytes in 0 blocks
> +==33321== Rerun with --leak-check=full to see details of leaked memory
> 
> 
>>
>>>
>>>  Thomas
>>>
>>
>> Ciao,
>>
>> Claudio
>>
>>
> 
> A more interesting update is what follows I think.
> 
> I was able to "fix" the problem shown by the reproducer:
> 
> @@ -643,7 +643,7 @@ static void qemu_account_warp_tim@@ -643,7 +643,7 @@ 
> static void qemu_account_warp_timer(void)
>  
>  static bool icount_state_needed(void *opaque)
>  {
> -    return use_icount;
> +    return 0;
>  }
>  
> by just slowing down qcow2_co_pwritev_task_entry with some tight loops,
> without changing any fields between runs (other than the reproducer icount 
> field removal).
> 
> I tried to insert the same slowdown just in savevm.c at the end of 
> save_snapshot, but that does not work, needs to be in the coroutine.
> 
> Thanks,
> 
> Claudio
> 
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]