[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor a
From: |
Kevin Wolf |
Subject: |
Re: [Qemu-block] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment |
Date: |
Wed, 3 Apr 2019 15:03:54 +0200 |
User-agent: |
Mutt/1.11.3 (2019-02-01) |
Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server; the bug is also present with the raw
> format driver when probing occurs. Basically, if we specify a
> particular alignment > 1, but defer the actual block status to the
> underlying file, and the underlying file has a smaller alignment, then
> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
> with status split at an alignment unacceptable to our layer. Many
> callers don't care, but NBD does - it is a violation of the NBD
> protocol to send status or read results split on an unaligned boundary
> (we've taught our client to be tolerant of such violations because of
> qemu 3.1; but we still need to fix our server for the sake of other
> stricter clients).
>
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
>
> Signed-off-by: Eric Blake <address@hidden>
> ---
> block/io.c | 108 +++++++++++++++++++++++++++++++++++--
> tests/qemu-iotests/221 | 10 ++++
> tests/qemu-iotests/221.out | 6 +++
> tests/qemu-iotests/241.out | 3 +-
> 4 files changed, 122 insertions(+), 5 deletions(-)
>
> diff --git a/block/io.c b/block/io.c
> index dfc153b8d82..936877d3745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2021,6 +2021,97 @@ int coroutine_fn
> bdrv_co_block_status_from_backing(BlockDriverState *bs,
> return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
> }
>
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> + bool want_zero,
> + int64_t offset, int64_t bytes,
> + int64_t *pnum, int64_t *map,
> + BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified sectors.
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * offset and count to be aligned, and guarantees the result will also
> + * be aligned (even if it requires piecing together multiple
> + * sub-aligned queries into an appropriate coalesced answer).
> + */
I think this comment should specify the result of the operation when the
aligned region consists of multiple subregions with different status.
Probably something like this:
- BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
- BDRV_BLOCK_ZERO is set if the flag is set for all subregions
- BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
the provided offsets are contiguous and file is the same for all
subregions.
- BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)
- BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
not set for any other subregion
- BDRV_BLOCK_RAW is never set
- *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
the offset of the first subregion then.
- *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
the *file of the subregions, which must be the same for all of them
(otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).
- *pnum: The sum of all subregions
> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
> + uint32_t align,
> + bool want_zero,
> + int64_t offset,
> + int64_t bytes,
> + int64_t *pnum,
> + int64_t *map,
> + BlockDriverState **file)
> +{
> + int ret;
> +
> + assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
> + ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map,
> file);
> + if (ret < 0) {
> + return ret;
> + }
> + if (!*pnum) {
> + assert(!bytes || ret & BDRV_BLOCK_EOF);
> + return ret;
> + }
> + if (*pnum >= align) {
> + if (!QEMU_IS_ALIGNED(*pnum, align)) {
> + ret &= ~BDRV_BLOCK_EOF;
> + *pnum = QEMU_ALIGN_DOWN(*pnum, align);
> + }
> + return ret;
> + }
So we decide to just shorten the function if we can return at least some
*pnum > 0. This means that is most cases, this function will have to be
called twice, once for the aligned part, and again for the misaligned
rest.
We do save a little if the caller isn't actually interested in mapping
the whole image, but just in a specific range before the misaligned
part.
So it depends on the use case whether this is an optimisation or a
pessimisation.
> + /* Merge in status for the rest of align */
> + while (*pnum < align) {
Okay, in any case, I can see it's a simplification. :-)
> + int ret2;
> + int64_t pnum2;
> + int64_t map2;
> + BlockDriverState *file2;
> +
> + ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
> + align - *pnum, &pnum2,
> + map ? &map2 : NULL, file ? &file2 :
> NULL);
> + if (ret2 < 0) {
> + return ret2;
> + }
> + if (ret2 & BDRV_BLOCK_EOF) {
> + ret |= BDRV_BLOCK_EOF;
> + if (!pnum2) {
> + pnum2 = align - *pnum;
> + ret2 |= BDRV_BLOCK_ZERO;
> + }
> + } else {
> + assert(pnum2);
> + }
> + if (ret2 & BDRV_BLOCK_DATA) {
> + ret |= BDRV_BLOCK_DATA;
> + }
> + if (!(ret2 & BDRV_BLOCK_ZERO)) {
> + ret &= ~BDRV_BLOCK_ZERO;
> + }
> + if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
> + (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
> + (map && *map + *pnum != map2) || (file && *file != file2))) {
> + ret &= ~BDRV_BLOCK_OFFSET_VALID;
> + if (map) {
> + *map = 0;
> + }
> + if (file) {
> + *file = NULL;
> + }
> + }
> + if (ret2 & BDRV_BLOCK_ALLOCATED) {
> + ret |= BDRV_BLOCK_ALLOCATED;
> + }
Hm, so if we have a region that consists of two subregion, one is
unallocated and the other one is a zero cluster. The result will be
DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.
Did you check this case and come to the conclusion that it's okay? If
so, I think a comment would be good.
> + if (ret2 & BDRV_BLOCK_RAW) {
> + assert(ret & BDRV_BLOCK_RAW);
> + assert(ret & BDRV_BLOCK_OFFSET_VALID);
> + }
Doesn't RAW mean that we need to recurse rather than returning the flag?
Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
can't we assert that it's clear?
> + *pnum += pnum2;
> + }
> + return ret;
> +}
Kevin
- Re: [Qemu-block] [PATCH 2/7] nbd/server: Trace server noncompliance on unaligned requests, (continued)
[Qemu-block] [PATCH 3/7] nbd/server: Don't fail NBD_OPT_INFO for byte-aligned sources, Eric Blake, 2019/04/02
[Qemu-block] [PATCH 4/7] iotests: Update 241 to expose backing layer fragmentation, Eric Blake, 2019/04/02
[Qemu-block] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment, Eric Blake, 2019/04/02
- Re: [Qemu-block] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment,
Kevin Wolf <=
[Qemu-block] [PATCH 6/7] nbd/server: Avoid unaligned read/block_status from backing, Eric Blake, 2019/04/02
[Qemu-block] [PATCH for-4.1 7/7] nbd/server: Avoid unaligned dirty-bitmap status, Eric Blake, 2019/04/02
[Qemu-block] [PATCH for-4.0? 8/7] nbd/client: Fix error message for server with unusable sizing, Eric Blake, 2019/04/04