qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection


From: Peter Lieven
Subject: Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection
Date: Fri, 17 Dec 2021 15:43:12 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

Am 04.12.21 um 00:04 schrieb Vladimir Sementsov-Ogievskiy:
> 03.12.2021 14:17, Peter Lieven wrote:
>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed 
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>>
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>>
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>>
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>>
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>>
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>>
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>>
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>>
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>>>> zero too.
>>>>>
>>>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>>>> the buf, isn't it?
>>>>>
>>>>> what about something like this:
>>>>>
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t 
>>>>> *buf, int n, int *pnum,
>>>>>           }
>>>>>       }
>>>>>   +    if (i == n) {
>>>>> +        /*
>>>>> +         * The whole buf is the same.
>>>>> +         *
>>>>> +         * if it's data, just return it. It's the old behavior.
>>>>> +         *
>>>>> +         * if it's zero, just return too. It will work good if target is 
>>>>> alredy
>>>>> +         * zeroed. And if next chunk is zero too we'll have no RMW and 
>>>>> no reason
>>>>> +         * to write data.
>>>>> +         */
>>>>> +        *pnum = i;
>>>>> +        return !is_zero;
>>>>> +    }
>>>>> +
>>>>>       tail = (sector_num + i) & (alignment - 1);
>>>>>       if (tail) {
>>>>>           if (is_zero && i <= tail) {
>>>>> -            /* treat unallocated areas which only consist
>>>>> -             * of a small tail as allocated. */
>>>>> +            /*
>>>>> +             * For sure next sector after i is data, and it will rewrite 
>>>>> this
>>>>> +             * tail anyway due to RMW. So, let's just write data now.
>>>>> +             */
>>>>>               is_zero = false;
>>>>>           }
>>>>>           if (!is_zero) {
>>>>> -            /* align up end offset of allocated areas. */
>>>>> +            /* If possible, align up end offset of allocated areas. */
>>>>>               i += alignment - tail;
>>>>>               i = MIN(i, n);
>>>>>           } else {
>>>>> -            /* align down end offset of zero areas. */
>>>>> +            /*
>>>>> +             * For sure next sector after i is data, and it will rewrite 
>>>>> this
>>>>> +             * tail anyway due to RMW. Better is avoid RMW and write 
>>>>> zeroes up
>>>>> +             * to aligned bound.
>>>>> +             */
>>>>>               i -= tail;
>>>>>           }
>>>>>       }
>>>> I think we forgot to follow up on this. Has anyone tested this
>>>> suggestion?
>>>>
>>>> Otherwise, I would try to rerun the tests I did with the my old and
>>>> Kevins suggestion.
>>> I noticed earlier this week that these patches are still in my
>>> development branch, but didn't actually pick it up again yet. So feel
>>> free to try it out.
>>
>>
>> It seems this time I forgot to follow up. Is this topic still open?
>>
>
> Most probably yes :) I now checked, that my proposed diff is still applicable 
> to master and don't break compilation. So, if you have some test, you can 
> check if it works better with the change.
>
Hi Vladimir, hi Kevin,


I was able to reproduce on CentOS 7 with qemu 5.1.0. The effect of the bad file 
is not as worse as in the ticket, but Vladimirs proposed fix solves the issue. 
I can also confirm that it does not

introduce additional overhead to my use cases. Vladimir, can you send a proper 
path please?

[root@VCORE-188 qemu]# ./qemu-img create -f raw -o extent_size_hint=1M 
/mnt/test.raw 10G
Formatting '/mnt/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
[root@VCORE-188 qemu]# ./qemu-img bench -f raw -t none -n -w /mnt/test.raw -c 
1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at 
offset 0, step size 8192)
Run completed in 397.917 seconds.
[root@VCORE-188 qemu]# filefrag /mnt/test.raw
/mnt/test.raw: 1000000 extents found
[root@VCORE-188 qemu]# ./qemu-img bench -f raw -t none -n -w /mnt/test.raw -c 
1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at 
offset 4096, step size 8192)
Run completed in 200.162 seconds.
[root@VCORE-188 qemu]# filefrag /mnt/test.raw
/mnt/test.raw: 2000000 extents found
[root@VCORE-188 qemu]# ./qemu-img convert -p -f raw -O raw /mnt/test.raw 
/mnt/convert.raw
    (100.00/100%)
[root@VCORE-188 qemu]# ./qemu-img info /mnt/convert.raw
image: /mnt/convert.raw
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 11.3 MiB


11.3 MiB is not as worse as the several GB from the ticket, but I see a disk 
size where there should be no usage at all.


[root@VCORE-188 qemu]# vi qemu-img.c
[root@VCORE-188 qemu]# make -f Makefile -j8 qemu-img
make[1]: Entering directory `/usr/src/qemu/slirp'
make[1]: Für das Ziel »all« ist nichts zu tun.
make[1]: Leaving directory `/usr/src/qemu/slirp'
  CC      qemu-img.o
  LINK    qemu-img

applied Vladimirs fix with the handling of (i==n) here.


[root@VCORE-188 qemu]# ./qemu-img convert -p -f raw -O raw /mnt/test.raw 
/mnt/convert.raw
    (100.00/100%)
[root@VCORE-188 qemu]# ./qemu-img info /mnt/convert.raw
image: /mnt/convert.raw
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 4 KiB


Best,

Peter






reply via email to

[Prev in Thread] Current Thread [Next in Thread]