[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection
From: |
Peter Lieven |
Subject: |
Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection |
Date: |
Fri, 17 Dec 2021 15:43:12 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 |
Am 04.12.21 um 00:04 schrieb Vladimir Sementsov-Ogievskiy:
> 03.12.2021 14:17, Peter Lieven wrote:
>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>>
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>>
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>>
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>>
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>>
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>>
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>>
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>>
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors
>>>>> must not align it down, to be possibly "merged" with next chunk if it is
>>>>> zero too.
>>>>>
>>>>> But it's still good to align zeroes down, if data starts somewhere inside
>>>>> the buf, isn't it?
>>>>>
>>>>> what about something like this:
>>>>>
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t
>>>>> *buf, int n, int *pnum,
>>>>> }
>>>>> }
>>>>> + if (i == n) {
>>>>> + /*
>>>>> + * The whole buf is the same.
>>>>> + *
>>>>> + * if it's data, just return it. It's the old behavior.
>>>>> + *
>>>>> + * if it's zero, just return too. It will work good if target is
>>>>> alredy
>>>>> + * zeroed. And if next chunk is zero too we'll have no RMW and
>>>>> no reason
>>>>> + * to write data.
>>>>> + */
>>>>> + *pnum = i;
>>>>> + return !is_zero;
>>>>> + }
>>>>> +
>>>>> tail = (sector_num + i) & (alignment - 1);
>>>>> if (tail) {
>>>>> if (is_zero && i <= tail) {
>>>>> - /* treat unallocated areas which only consist
>>>>> - * of a small tail as allocated. */
>>>>> + /*
>>>>> + * For sure next sector after i is data, and it will rewrite
>>>>> this
>>>>> + * tail anyway due to RMW. So, let's just write data now.
>>>>> + */
>>>>> is_zero = false;
>>>>> }
>>>>> if (!is_zero) {
>>>>> - /* align up end offset of allocated areas. */
>>>>> + /* If possible, align up end offset of allocated areas. */
>>>>> i += alignment - tail;
>>>>> i = MIN(i, n);
>>>>> } else {
>>>>> - /* align down end offset of zero areas. */
>>>>> + /*
>>>>> + * For sure next sector after i is data, and it will rewrite
>>>>> this
>>>>> + * tail anyway due to RMW. Better is avoid RMW and write
>>>>> zeroes up
>>>>> + * to aligned bound.
>>>>> + */
>>>>> i -= tail;
>>>>> }
>>>>> }
>>>> I think we forgot to follow up on this. Has anyone tested this
>>>> suggestion?
>>>>
>>>> Otherwise, I would try to rerun the tests I did with the my old and
>>>> Kevins suggestion.
>>> I noticed earlier this week that these patches are still in my
>>> development branch, but didn't actually pick it up again yet. So feel
>>> free to try it out.
>>
>>
>> It seems this time I forgot to follow up. Is this topic still open?
>>
>
> Most probably yes :) I now checked, that my proposed diff is still applicable
> to master and don't break compilation. So, if you have some test, you can
> check if it works better with the change.
>
Hi Vladimir, hi Kevin,
I was able to reproduce on CentOS 7 with qemu 5.1.0. The effect of the bad file
is not as worse as in the ticket, but Vladimirs proposed fix solves the issue.
I can also confirm that it does not
introduce additional overhead to my use cases. Vladimir, can you send a proper
path please?
[root@VCORE-188 qemu]# ./qemu-img create -f raw -o extent_size_hint=1M
/mnt/test.raw 10G
Formatting '/mnt/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576
[root@VCORE-188 qemu]# ./qemu-img bench -f raw -t none -n -w /mnt/test.raw -c
1000000 -S 8192 -o 0
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at
offset 0, step size 8192)
Run completed in 397.917 seconds.
[root@VCORE-188 qemu]# filefrag /mnt/test.raw
/mnt/test.raw: 1000000 extents found
[root@VCORE-188 qemu]# ./qemu-img bench -f raw -t none -n -w /mnt/test.raw -c
1000000 -S 8192 -o 4096
Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at
offset 4096, step size 8192)
Run completed in 200.162 seconds.
[root@VCORE-188 qemu]# filefrag /mnt/test.raw
/mnt/test.raw: 2000000 extents found
[root@VCORE-188 qemu]# ./qemu-img convert -p -f raw -O raw /mnt/test.raw
/mnt/convert.raw
(100.00/100%)
[root@VCORE-188 qemu]# ./qemu-img info /mnt/convert.raw
image: /mnt/convert.raw
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 11.3 MiB
11.3 MiB is not as worse as the several GB from the ticket, but I see a disk
size where there should be no usage at all.
[root@VCORE-188 qemu]# vi qemu-img.c
[root@VCORE-188 qemu]# make -f Makefile -j8 qemu-img
make[1]: Entering directory `/usr/src/qemu/slirp'
make[1]: Für das Ziel »all« ist nichts zu tun.
make[1]: Leaving directory `/usr/src/qemu/slirp'
CC qemu-img.o
LINK qemu-img
applied Vladimirs fix with the handling of (i==n) here.
[root@VCORE-188 qemu]# ./qemu-img convert -p -f raw -O raw /mnt/test.raw
/mnt/convert.raw
(100.00/100%)
[root@VCORE-188 qemu]# ./qemu-img info /mnt/convert.raw
image: /mnt/convert.raw
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 4 KiB
Best,
Peter