qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: migration: broken snapshot saves appear on s390 when small fields in


From: Bruce Rogers
Subject: Re: migration: broken snapshot saves appear on s390 when small fields in migration stream removed
Date: Tue, 28 Jul 2020 07:23:05 -0600
User-agent: Evolution 3.36.4

On Tue, 2020-07-28 at 11:43 +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.07.2020 02:09, Bruce Rogers wrote:
> > On Tue, 2020-07-21 at 10:22 +0200, Claudio Fontana wrote:
> > > On 7/20/20 8:24 PM, Claudio Fontana wrote:
> > > > I have now been able to reproduce this on X86 as well.
> > > > 
> > > > It happens much more rarely, about once every 10 times.
> > > > 
> > > > I will sort out the data and try to make it even more
> > > > reproducible,
> > > > then post my findings in detail.
> > > > 
> > > > Overall I proceeded as follows:
> > > > 
> > > > 1) hooked the savevm code to skip all fields with the exception
> > > > of
> > > > "s390-skeys". So only s390-skeys are actually saved.
> > > > 
> > > > 2) reimplemented "s390-skeys" in a common implementation in
> > > > cpus.c,
> > > > used on both x86 and s390, modeling the behaviour of save/load
> > > > from
> > > > hw/s390
> > > > 
> > > > 3) ran ./check -qcow2 267 on both x86 and s390.
> > > > 
> > > > In the case of s390, failure seems to be reproducible 100% of
> > > > the
> > > > times.
> > > > On X86, it is as mentioned failing about 10% of the times.
> > > > 
> > > > Ciao,
> > > > 
> > > > Claudio
> > > 
> > > And here is a small series of two patches that can be used to
> > > reproduce the problem.
> > > 
> > > Clearly, this is not directly related to s390 or to skeys or to
> > > icount in particular, it is just an issue that happened to be
> > > more
> > > visible there.
> > > 
> > > If you could help with this, please apply the attached patches.
> > > 
> > > Patch 1 just adds a new "300" iotest. It is way easier to extract
> > > the
> > > relevant part out of test 267, which does a bit too much in the
> > > same
> > > file.
> > > Also this allows easier use of valgrind, since it does not
> > > "require"
> > > anything.
> > > 
> > > Patch 2 hooks the savevm code to skip all fields during the
> > > snapshot
> > > with the exception of "s390-skeys", a new artificial field
> > > implemented to
> > > model what the real s390-skeys is doing.
> > > 
> > > After applying patch 1 and patch 2, you can test (also on X86),
> > > with:
> > > 
> > > ./check -qcow2 300
> > > 
> > > On X86 many runs will be successful, but a certain % of them will
> > > instead fail like this:
> > > 
> > > 
> > > claudio@linux-ch70:~/git/qemu-pristine/qemu-build/tests/qemu-
> > > iotests>
> > > ./check -qcow2 300
> > > QEMU          -- "/home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/../../x86_64-softmmu/qemu-system-x86_64"
> > > -nodefaults -display none -accel qtest
> > > QEMU_IMG      -- "/home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/../../qemu-img"
> > > QEMU_IO       -- "/home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/../../qemu-io"  --cache writeback --aio
> > > threads -f qcow2
> > > QEMU_NBD      -- "/home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/../../qemu-nbd"
> > > IMGFMT        -- qcow2 (compat=1.1)
> > > IMGPROTO      -- file
> > > PLATFORM      -- Linux/x86_64 linux-ch70 4.12.14-lp151.28.36-
> > > default
> > > TEST_DIR      -- /home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/scratch
> > > SOCK_DIR      -- /tmp/tmp.gdcUu3l0SM
> > > SOCKET_SCM_HELPER -- /home/claudio/git/qemu-pristine/qemu-
> > > build/tests/qemu-iotests/socket_scm_helper
> > > 
> > > 300      fail       [10:14:05] [10:14:06]      (last:
> > > 0s)    output
> > > mismatch (see 300.out.bad)
> > > --- /home/claudio/git/qemu-pristine/qemu/tests/qemu-
> > > iotests/300.out     2020-07-21 10:03:54.468104764 +0200
> > > +++ /home/claudio/git/qemu-pristine/qemu-build/tests/qemu-
> > > iotests/300.out.bad   2020-07-21 10:14:06.098090543 +0200
> > > @@ -12,6 +12,9 @@
> > >   ID        TAG                 VM
> > > SIZE                DATE       VM
> > > CLOCK
> > >   --        snap0                  SIZE yyyy-mm-dd
> > > hh:mm:ss   00:00:00.000
> > >   (qemu) loadvm snap0
> > > +Unexpected storage key data: 0
> > > +error while loading state for instance 0x0 of device 's390-
> > > skeys'
> > > +Error: Error -22 while loading VM state
> > >   (qemu) quit
> > >   
> > >   *** done
> > > Failures: 300
> > > Failed 1 of 1 iotests
> > > 
> > > 
> > > At this point somebody more knowledgeable about QCOW2, coroutines
> > > and
> > > backing files could chime in?
> > > 
> > <trim>
> > 
> > I used the reproducer you provide here to do a git bisect as I
> > assume
> > whatever is now broken wasn't always broken, and it pointed to the
> > following commit:
> > 
> > commit df893d25ceea3c0dcbe6d6b425309317fab6b22e (refs/bisect/bad)
> > Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > Date:   Tue Jun 4 19:15:13 2019 +0300
> > 
> >      block/qcow2: implement .bdrv_co_preadv_part
> > 
> > Indeed, I am currently able to reliable reproduce the issue with
> > this
> > commit applied, and not reproduce it without it.
> 
> Do you have a reproducer on top of df893d25ceea? I tried to apply the
> patches on it, there are a lot of conflicts and after solving them,
> the test not work for me.
> 
> What do you mean by not reproduce without df893d25ceea3c0d? I also
> tried to do "git rebase df893d25ceea3c0d --onto df893d25ceea3c0d^",
> after applying the reproducing patches on master, but again, there
> are conflicts..
> 
> How did you applied
> 
> commit 88f468e54696cd9ffc28ecf96f18f04125402541
> Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Date:   Mon Sep 16 20:53:22 2019 +0300
> 
>      block/qcow2: refactor qcow2_co_preadv_part
> 
> and, which is more important further
> 
> commit d710cf575ad5fb3ab329204620de45bfe50caa53
> Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Date:   Mon Sep 16 20:53:24 2019 +0300
> 
>      block/qcow2: introduce parallel subrequest handling in read and
> write
> 
> without df893d25ceea3c0d ?
> 
> 
> I can reproduce the bug on master with Claudio's two patches, but how
> to reproduce the relation with df893d25ceea3c0d?
> 
> 

Yeah, I didn't provide full reproducer info for the point where the
patch I mention got applied. No matter, it looks like a fix has been
identified in the mean time.

Thanks for the quick response, though.

Bruce




reply via email to

[Prev in Thread] Current Thread [Next in Thread]