qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection


From: Peter Lieven
Subject: Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection
Date: Sat, 4 Dec 2021 10:53:02 +0100


> Am 04.12.2021 um 00:04 schrieb Vladimir Sementsov-Ogievskiy 
> <vsementsov@virtuozzo.com>:
> 
> 03.12.2021 14:17, Peter Lieven wrote:
>>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed 
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>> 
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>> 
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>> 
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>> 
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>> 
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>> 
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>> 
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>> 
>>>>>> Kevin
>>>>>> 
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>> 
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>>>> zero too.
>>>>> 
>>>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>>>> the buf, isn't it?
>>>>> 
>>>>> what about something like this:
>>>>> 
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t 
>>>>> *buf, int n, int *pnum,
>>>>>          }
>>>>>      }
>>>>>  +    if (i == n) {
>>>>> +        /*
>>>>> +         * The whole buf is the same.
>>>>> +         *
>>>>> +         * if it's data, just return it. It's the old behavior.
>>>>> +         *
>>>>> +         * if it's zero, just return too. It will work good if target is 
>>>>> alredy
>>>>> +         * zeroed. And if next chunk is zero too we'll have no RMW and 
>>>>> no reason
>>>>> +         * to write data.
>>>>> +         */
>>>>> +        *pnum = i;
>>>>> +        return !is_zero;
>>>>> +    }
>>>>> +
>>>>>      tail = (sector_num + i) & (alignment - 1);
>>>>>      if (tail) {
>>>>>          if (is_zero && i <= tail) {
>>>>> -            /* treat unallocated areas which only consist
>>>>> -             * of a small tail as allocated. */
>>>>> +            /*
>>>>> +             * For sure next sector after i is data, and it will rewrite 
>>>>> this
>>>>> +             * tail anyway due to RMW. So, let's just write data now.
>>>>> +             */
>>>>>              is_zero = false;
>>>>>          }
>>>>>          if (!is_zero) {
>>>>> -            /* align up end offset of allocated areas. */
>>>>> +            /* If possible, align up end offset of allocated areas. */
>>>>>              i += alignment - tail;
>>>>>              i = MIN(i, n);
>>>>>          } else {
>>>>> -            /* align down end offset of zero areas. */
>>>>> +            /*
>>>>> +             * For sure next sector after i is data, and it will rewrite 
>>>>> this
>>>>> +             * tail anyway due to RMW. Better is avoid RMW and write 
>>>>> zeroes up
>>>>> +             * to aligned bound.
>>>>> +             */
>>>>>              i -= tail;
>>>>>          }
>>>>>      }
>>>> I think we forgot to follow up on this. Has anyone tested this
>>>> suggestion?
>>>> 
>>>> Otherwise, I would try to rerun the tests I did with the my old and
>>>> Kevins suggestion.
>>> I noticed earlier this week that these patches are still in my
>>> development branch, but didn't actually pick it up again yet. So feel
>>> free to try it out.
>> It seems this time I forgot to follow up. Is this topic still open?
> 
> Most probably yes :) I now checked, that my proposed diff is still applicable 
> to master and don't break compilation. So, if you have some test, you can 
> check if it works better with the change.

Unfortunately the reproducer in the original ticket does not work for me. It 
might also depend on the XFS version used.
What I can say is that your patch has no negative impact for my use cases.
If I find the time I will try the reproducer on CentOS7

Best,
Peter





reply via email to

[Prev in Thread] Current Thread [Next in Thread]