qemu-stable
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-stable] [PATCH v2 5/9] block: Pass unaligned discard requests


From: Kevin Wolf
Subject: Re: [Qemu-stable] [PATCH v2 5/9] block: Pass unaligned discard requests to drivers
Date: Tue, 22 Nov 2016 15:03:43 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <address@hidden>
> CC: address@hidden
> Signed-off-by: Eric Blake <address@hidden>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
> int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState 
> *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment
> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState 
> *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, 
> INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }
> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;
> +            }
> +        }

Hm, from the commit message I expected splitting requests in three
(head, bulk, tail), but actually we can end up splitting it in five?

Just to check whether I got this right, let me try an example: Let's
assume request alignment 512, pdiscard alignment 64k, and we get a
discard request with offset 510, length 130k. This algorithm makes the
following calls to the driver:

1. pdiscard offset=510, len=2               | new count = 130k - 2
2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
3. pdiscard offset=64k, len=64k             | new count = 2558
4. pdiscard offset=128k, len=2048           | new count = 510
5. pdiscard offset=130k, len=510            | new count = 0

Correct?

If so, is this really necessary or even helpful? I see that the iscsi
driver throws away requests 1 and 5 and needs the split because
otherwise it would disregard the areas covered by 2 and 4, too. But why
can't or shouldn't the iscsi driver do this rounding itself when it gets
combined 1+2 and 4+5 requests?

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]