[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter drive
From: |
Vladimir Sementsov-Ogievskiy |
Subject: |
Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver |
Date: |
Fri, 14 Jun 2019 09:04:22 +0000 |
13.06.2019 18:57, Max Reitz wrote:
> On 29.05.19 17:46, Vladimir Sementsov-Ogievskiy wrote:
>> Backup-top filter does copy-before-write operation. It should be
>> inserted above active disk and has a target node for CBW, like the
>> following:
>>
>> +-------+
>> | Guest |
>> +-------+
>> |r,w
>> v
>> +------------+ target +---------------+
>> | backup_top |---------->| target(qcow2) |
>> +------------+ CBW +---------------+
>> |
>> backing |r,w
>> v
>> +-------------+
>> | Active disk |
>> +-------------+
>>
>> The driver will be used in backup instead of write-notifiers.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>> ---
>> block/backup-top.h | 64 +++++++++
>> block/backup-top.c | 322 ++++++++++++++++++++++++++++++++++++++++++++
>> block/Makefile.objs | 2 +
>> 3 files changed, 388 insertions(+)
>> create mode 100644 block/backup-top.h
>> create mode 100644 block/backup-top.c
>>
>> diff --git a/block/backup-top.h b/block/backup-top.h
>> new file mode 100644
>> index 0000000000..788e18c358
>> --- /dev/null
>> +++ b/block/backup-top.h
>> @@ -0,0 +1,64 @@
>> +/*
>> + * backup-top filter driver
>> + *
>> + * The driver performs Copy-Before-Write (CBW) operation: it is injected
>> above
>> + * some node, and before each write it copies _old_ data to the target node.
>> + *
>> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
>> + *
>> + * Author:
>> + * Sementsov-Ogievskiy Vladimir <address@hidden>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#ifndef BACKUP_TOP_H
>> +#define BACKUP_TOP_H
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "block/block_int.h"
>> +
>> +typedef void (*BackupTopProgressCallback)(uint64_t done, void *opaque);
>> +typedef struct BDRVBackupTopState {
>> + HBitmap *copy_bitmap; /* what should be copied to @target on guest
>> write. */
>> + BdrvChild *target;
>> +
>> + BackupTopProgressCallback progress_cb;
>> + void *progress_opaque;
>> +} BDRVBackupTopState;
>> +
>> +/*
>> + * bdrv_backup_top_append
>> + *
>> + * Append backup_top filter node above @source node. @target node will
>> receive
>> + * the data backed up during CBE operations. New filter together with
>> @target
>> + * node are attached to @source aio context.
>> + *
>> + * The resulting filter node is implicit.
>
> Why? It’s just as easy for the caller to just make it implicit if it
> should be. (And usually the caller should decide that.)
Personally, I don't know what are the reasons for filters to bi implicit or not,
I just made it like other job-filters.. I can move making-implicit to the caller
or drop it at all (if it will work).
>
>> + *
>> + * @copy_bitmap selects regions which needs CBW. Furthermore, backup_top
>> will
>> + * use exactly this bitmap, so it may be used to control backup_top behavior
>> + * dynamically. Caller should not release @copy_bitmap during life-time of
>> + * backup_top. Progress is tracked by calling @progress_cb function.
>> + */
>> +BlockDriverState *bdrv_backup_top_append(
>> + BlockDriverState *source, BlockDriverState *target,
>> + HBitmap *copy_bitmap, Error **errp);
>> +void bdrv_backup_top_set_progress_callback(
>> + BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>> + void *progress_opaque);
>> +void bdrv_backup_top_drop(BlockDriverState *bs);
>> +
>> +#endif /* BACKUP_TOP_H */
>> diff --git a/block/backup-top.c b/block/backup-top.c
>> new file mode 100644
>> index 0000000000..1daa02f539
>> --- /dev/null
>> +++ b/block/backup-top.c
>> @@ -0,0 +1,322 @@
>> +/*
>> + * backup-top filter driver
>> + *
>> + * The driver performs Copy-Before-Write (CBW) operation: it is injected
>> above
>> + * some node, and before each write it copies _old_ data to the target node.
>> + *
>> + * Copyright (c) 2018 Virtuozzo International GmbH. All rights reserved.
>> + *
>> + * Author:
>> + * Sementsov-Ogievskiy Vladimir <address@hidden>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qemu/cutils.h"
>> +#include "qapi/error.h"
>> +#include "block/block_int.h"
>> +#include "block/qdict.h"
>> +
>> +#include "block/backup-top.h"
>> +
>> +static coroutine_fn int backup_top_co_preadv(
>> + BlockDriverState *bs, uint64_t offset, uint64_t bytes,
>> + QEMUIOVector *qiov, int flags)
>> +{
>> + /*
>> + * Features to be implemented:
>> + * F1. COR. save read data to fleecing target for fast access
>> + * (to reduce reads). This possibly may be done with use of
>> copy-on-read
>> + * filter, but we need an ability to make COR requests optional: for
>> + * example, if target is a ram-cache, and if it is full now, we
>> should
>> + * skip doing COR request, as it is actually not necessary.
>> + *
>> + * F2. Feature for guest: read from fleecing target if data is in
>> ram-cache
>> + * and is unchanged
>> + */
>> +
>> + return bdrv_co_preadv(bs->backing, offset, bytes, qiov, flags);
>> +}
>> +
>> +static coroutine_fn int backup_top_cbw(BlockDriverState *bs, uint64_t
>> offset,
>> + uint64_t bytes)
>> +{
>> + int ret = 0;
>> + BDRVBackupTopState *s = bs->opaque;
>> + uint64_t gran = 1UL << hbitmap_granularity(s->copy_bitmap);
>> + uint64_t end = QEMU_ALIGN_UP(offset + bytes, gran);
>> + uint64_t off = QEMU_ALIGN_DOWN(offset, gran), len;
>
> The “, len” is a bit weirdly placed there, why not define it on a
> separate line as "uint64_t len = end - off"?
>
>> + void *buf = qemu_blockalign(bs, end - off);
>> +
>> + /*
>> + * Features to be implemented:
>> + * F3. parallelize copying loop
>> + * F4. detect zeroes ? or, otherwise, drop detect zeroes from backup
>> code
>> + * and just enable zeroes detecting on target
>> + * F5. use block_status ?
>> + * F6. don't copy clusters which are already cached by COR [see F1]
>> + * F7. if target is ram-cache and it is full, there should be a
>> possibility
>> + * to drop not necessary data (cached by COR [see F1]) to handle CBW
>> + * fast.
>
> Also “drop BDRV_REQ_SERIALISING from writes to s->target unless necessary”?
hmm, yes. It may be a filter parameter, serialize writes or not.
>
>> + */
>> +
>> + len = end - off;
>> + while (hbitmap_next_dirty_area(s->copy_bitmap, &off, &len)) {
>> + hbitmap_reset(s->copy_bitmap, off, len);
>> +
>> + ret = bdrv_co_pread(bs->backing, off, len, buf,
>> + BDRV_REQ_NO_SERIALISING);
>> + if (ret < 0) {
>> + hbitmap_set(s->copy_bitmap, off, len);
>> + goto out;
>> + }
>> +
>> + ret = bdrv_co_pwrite(s->target, off, len, buf,
>> BDRV_REQ_SERIALISING);
>> + if (ret < 0) {
>> + hbitmap_set(s->copy_bitmap, off, len);
>> + goto out;
>> + }
>> +
>> + if (s->progress_cb) {
>> + s->progress_cb(len, s->progress_opaque);
>> + }
>> + off += len;
>> + if (off >= end) {
>> + break;
>> + }
>> + len = end - off;
>> + }
>> +
>> +out:
>> + qemu_vfree(buf);
>> +
>> + /*
>> + * F8. we fail guest request in case of error. We can alter it by
>> + * possibility to fail copying process instead, or retry several times,
>> or
>> + * may be guest pause, etc.
>> + */
>> + return ret;
>> +}
>> +
>> +static int coroutine_fn backup_top_co_pdiscard(BlockDriverState *bs,
>> + int64_t offset, int bytes)
>> +{
>> + int ret = backup_top_cbw(bs, offset, bytes);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + /*
>> + * Features to be implemented:
>> + * F9. possibility of lazy discard: just defer the discard after
>> fleecing
>> + * completion. If write (or new discard) occurs to the same area,
>> just
>> + * drop deferred discard.
>> + */
>> +
>> + return bdrv_co_pdiscard(bs->backing, offset, bytes);
>> +}
>> +
>> +static int coroutine_fn backup_top_co_pwrite_zeroes(BlockDriverState *bs,
>> + int64_t offset, int bytes, BdrvRequestFlags flags)
>> +{
>> + int ret = backup_top_cbw(bs, offset, bytes);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + return bdrv_co_pwrite_zeroes(bs->backing, offset, bytes, flags);
>> +}
>> +
>> +static coroutine_fn int backup_top_co_pwritev(BlockDriverState *bs,
>> + uint64_t offset,
>> + uint64_t bytes,
>> + QEMUIOVector *qiov, int flags)
>> +{
>> + if (!(flags & BDRV_REQ_WRITE_UNCHANGED)) {
>> + int ret = backup_top_cbw(bs, offset, bytes);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + }
>> +
>> + return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
>> +}
>> +
>> +static int coroutine_fn backup_top_co_flush(BlockDriverState *bs)
>> +{
>> + if (!bs->backing) {
>> + return 0;
>> + }
>> +
>> + return bdrv_co_flush(bs->backing->bs);
>
> Should we flush the target, too?
Hm, you've asked it already, on previous version :) Backup don't do it,
so I just keep old behavior. And what is the reason to flush backup target
on any guest flush?
>
>> +}
>> +
>> +static void backup_top_refresh_filename(BlockDriverState *bs)
>> +{
>> + if (bs->backing == NULL) {
>> + /*
>> + * we can be here after failed bdrv_attach_child in
>> + * bdrv_set_backing_hd
>> + */
>> + return;
>> + }
>> + bdrv_refresh_filename(bs->backing->bs);
>
> bdrv_refresh_filename() should have done this already.
>
>> + pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
>> + bs->backing->bs->filename);
>> +}
>> +
>> +static void backup_top_child_perm(BlockDriverState *bs, BdrvChild *c,
>> + const BdrvChildRole *role,
>> + BlockReopenQueue *reopen_queue,
>> + uint64_t perm, uint64_t shared,
>> + uint64_t *nperm, uint64_t *nshared)
>> +{
>> + /*
>> + * We have HBitmap in the state, its size is fixed, so we never allow
>> + * resize.
>> + */
>> + uint64_t rw = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
>> + BLK_PERM_WRITE;
>> +
>> + bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
>> + nperm, nshared);
>> +
>> + *nperm = *nperm & rw;
>> + *nshared = *nshared & rw;
>
> Somehow, that bdrv_filter_default_perm() call doesn’t make this function
> easier for me.
>
> It seems like this is basically just “*nperm = perm & rw” and
> “*nshared = shared & rw”.
>
>> +
>> + if (role == &child_file) {
>> + /*
>> + * Target child
>> + *
>> + * Share write to target (child_file), to not interfere
>> + * with guest writes to its disk which may be in target backing
>> chain.
>> + */
>> + if (perm & BLK_PERM_WRITE) {
>> + *nshared = *nshared | BLK_PERM_WRITE;
>
> Why not always share WRITE on the target?
Hmm, it's a bad thing to share writes on target, so I'm trying to reduce number
of
cases when we have share it.
>
>> + }
>> + } else {
>> + /* Source child */
>> + if (perm & BLK_PERM_WRITE) {
>
> Or WRITE_UNCHANGED, no?
Why? We don't need doing CBW for unchanged write.
>
>> + *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
>> + }
>> + *nshared =
>> + *nshared & (BLK_PERM_CONSISTENT_READ |
>> BLK_PERM_WRITE_UNCHANGED);
>
> And here it isn’t “*nshared = shared & rw” but “rw & ~WRITE”.
>
> I feel like this function would be simpler if you just set *nperm and
> *nshared separately in the two branches of this condition, without
> setting a “default” first.
Ok, I'll try
>
>> + }
>> +}
>> +
>> +BlockDriver bdrv_backup_top_filter = {
>> + .format_name = "backup-top",
>> + .instance_size = sizeof(BDRVBackupTopState),
>> +
>> + .bdrv_co_preadv = backup_top_co_preadv,
>> + .bdrv_co_pwritev = backup_top_co_pwritev,
>> + .bdrv_co_pwrite_zeroes = backup_top_co_pwrite_zeroes,
>> + .bdrv_co_pdiscard = backup_top_co_pdiscard,
>> + .bdrv_co_flush = backup_top_co_flush,
>> +
>> + .bdrv_co_block_status = bdrv_co_block_status_from_backing,
>> +
>> + .bdrv_refresh_filename = backup_top_refresh_filename,
>> +
>> + .bdrv_child_perm = backup_top_child_perm,
>> +
>> + .is_filter = true,
>> +};
>> +
>> +BlockDriverState *bdrv_backup_top_append(BlockDriverState *source,
>> + BlockDriverState *target,
>> + HBitmap *copy_bitmap,
>> + Error **errp)
>> +{
>> + Error *local_err = NULL;
>> + BDRVBackupTopState *state;
>> + BlockDriverState *top = bdrv_new_open_driver(&bdrv_backup_top_filter,
>> + NULL, BDRV_O_RDWR, errp);
>> +
>> + if (!top) {
>> + return NULL;
>> + }
>> +
>> + top->implicit = true;
>
> As I said above, I think the caller should set this.
>
>> + top->total_sectors = source->total_sectors;
>> + top->bl.opt_mem_alignment = MAX(bdrv_opt_mem_align(source),
>> + bdrv_opt_mem_align(target));
>> + top->opaque = state = g_new0(BDRVBackupTopState, 1);
>> + state->copy_bitmap = copy_bitmap;
>> +
>> + bdrv_ref(target);
>> + state->target = bdrv_attach_child(top, target, "target", &child_file,
>> errp);
>> + if (!state->target) {
>> + bdrv_unref(target);
>> + bdrv_unref(top);
>> + return NULL;
>> + }
>> +
>> + bdrv_set_aio_context(top, bdrv_get_aio_context(source));
>> + bdrv_set_aio_context(target, bdrv_get_aio_context(source));
>
> I suppose these calls aren’t necessary anymore (compare d0ee0204f4009).
Hmm, see it, agree.
> (I’m not fully sure, though. In any case, they would need to be
> bdrv_try_set_aio_context() if they were still necessary.)
>
>> +
>> + bdrv_drained_begin(source);
>> +
>> + bdrv_ref(top);
>> + bdrv_append(top, source, &local_err);
>> + if (local_err) {
>> + error_prepend(&local_err, "Cannot append backup-top filter: ");
>> + }
>> +
>> + bdrv_drained_end(source);
>> +
>> + if (local_err) {
>> + bdrv_unref_child(top, state->target);
>> + bdrv_unref(top);
>> + error_propagate(errp, local_err);
>> + return NULL;
>> + }
>> +
>> + return top;
>> +}
>
> I guess it would be nice if it users could blockdev-add a backup-top
> node to basically get a backup with sync=none.
>
> (The bdrv_open implementation should then create a new bitmap and
> initialize it to be fully set.)
>
> But maybe it wouldn’t be so nice and I just have feverish dreams.
When series begun, I was trying to make exactly this - user-available filter,
which may be used in separate, but you was against)
Maybe, not totally against, but I decided not to argue long, and instead make
filter implicit and drop public api (like mirror and commit filters), but place
it
in a separate file (no one will argue against putting large enough and complete
new feature, represented by object into a separate file :). And this actually
makes it easy to publish this filter at some moment. And now I think it was
good decision anyway, as we postponed arguing around API and around shared dirty
bitmaps.
So publishing the filter is future step.
>
>> +void bdrv_backup_top_set_progress_callback(
>> + BlockDriverState *bs, BackupTopProgressCallback progress_cb,
>> + void *progress_opaque)
>> +{
>> + BDRVBackupTopState *s = bs->opaque;
>> +
>> + s->progress_cb = progress_cb;
>> + s->progress_opaque = progress_opaque;
>> +}
>> +
>> +void bdrv_backup_top_drop(BlockDriverState *bs)
>> +{
>> + BDRVBackupTopState *s = bs->opaque;
>> + AioContext *aio_context = bdrv_get_aio_context(bs);
>> +
>> + aio_context_acquire(aio_context);
>> +
>> + bdrv_drained_begin(bs);
>> +
>> + bdrv_child_try_set_perm(bs->backing, 0, BLK_PERM_ALL, &error_abort);
>
> Pre-existing in other jobs, but I think calling this function is
> dangerous. (Which is why I remove it in my “block: Ignore loosening
> perm restrictions failures” series.)
Hmm, good thing.. Should I rebase on it?
>
>> + bdrv_replace_node(bs, backing_bs(bs), &error_abort);
>> + bdrv_set_backing_hd(bs, NULL, &error_abort);
>
> I think some of this function should be in a .bdrv_close()
> implementation, for example this bdrv_set_backing_hd() call.
Why? We don't have .bdrv_open, so why to have .bdrv_close? I think, when
we publish this filter most of _add() and _drop() will be refactored to
open() and close(). Is there a real reason to implement .close() now?
>
>> + bdrv_drained_end(bs);
>> +
>> + if (s->target) {
>> + bdrv_unref_child(bs, s->target);
>
> And this. Well, neither needs to be in a .bdrv_close() implementation,
> actually, because bdrv_close() will just do it by itself.
>
> I suppose you’re trying to remove the children so the node is no longer
> usable after this point; but that isn’t quite right, then. The filter
> functions still refer to s->target and bs->backing, so you need to stop
> them from doing anything then.
>
> At this point, you might as well add a variable to BDRVBackupTopState
> that says whether the filter is still supposed to be usable (like the
> “stop” variable I added to mirror in “block/mirror: Fix child
> permissions”). If so, the permission function should signal 0/ALL for
> the permissions on backing (and probably target), and all filter
> functions always immediately return an error.
>
> So I don’t think backing and target need to be removed here. We can
> wait for that until bdrv_close(), but we should ensure that the filter
> really is unusable after this function has been called.
Ok, thank you for reviewing!
>
> Max
>
>> + }
>> + bdrv_unref(bs);
>> +
>> + aio_context_release(aio_context);
>> +}
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index ae11605c9f..dfbdfe6ab4 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -40,6 +40,8 @@ block-obj-y += throttle.o copy-on-read.o
>>
>> block-obj-y += crypto.o
>>
>> +block-obj-y += backup-top.o
>> +
>> common-obj-y += stream.o
>>
>> nfs.o-libs := $(LIBNFS_LIBS)
>>
>
>
--
Best regards,
Vladimir
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Max Reitz, 2019/06/13
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver,
Vladimir Sementsov-Ogievskiy <=
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Max Reitz, 2019/06/14
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Vladimir Sementsov-Ogievskiy, 2019/06/14
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Max Reitz, 2019/06/14
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Vladimir Sementsov-Ogievskiy, 2019/06/17
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Max Reitz, 2019/06/17
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Kevin Wolf, 2019/06/17
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Max Reitz, 2019/06/17
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Kevin Wolf, 2019/06/17
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Vladimir Sementsov-Ogievskiy, 2019/06/18
- Re: [Qemu-block] [PATCH v8 4/7] block: introduce backup-top filter driver, Kevin Wolf, 2019/06/18