qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dropping posix_fallocate for -o preallocation falloc


From: Kevin Wolf
Subject: Re: Dropping posix_fallocate for -o preallocation falloc
Date: Tue, 25 Aug 2020 13:27:40 +0200

Am 23.08.2020 um 16:46 hat Nir Soffer geschrieben:
> Using -o preallocation falloc works great on NFS 4.2 and local file system,
> when fallocate() is supported, but when it is not, posix_fallocate falls back
> to very inefficient way:
> https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
> 
> This will read the last byte for every 4k block, and if the byte is
> null, write one null byte.
> 
> This minimizes the amount of data sent over the wire, but is very slow.
> 
> In file-posix we optimize this flow by not truncating the file to the
> final size, so this will only write one null byte for every 4k block,
> but this is still very slow.
> 
> Except the poor performance, we have a bug showing that for some reason,
> this does not work well with OFD locking:
> https://bugzilla.redhat.com/1851097
> 
> In oVirt 4.4.2 we avoid the issue by not using -o preallocation
> falloc. Instead we use our own fallocate helper:
> https://github.com/oVirt/vdsm/blob/master/helpers/fallocate
> 
> (We got feedback that the name of this helper is confusing since it does
> destructive operation when fallocate() is not supported. We will
> change the name)
> 
> This helper is similar to posix_fallocate, but instead of falling back
> to writing one byte per 4k block, it falls back to writing zeros in
> large blocks.
> 
> Testing shows that this improves fallocation time by 385% for one disk, and
> 468% for 10 concurrent disk preallocation:
> https://bugzilla.redhat.com/1850267#c25
> 
> I think the next step is to move this change into qemu, so all users can
> benefit from this change.
> 
> I think the way to do this is to replace posix_fallocate() with fallocate(),
> and fallback to "full" preallocation if fallocate is not supported.

Yes, this makes sense to me.

However, what full preallocation uses, is not exactly a large buffer
either (64k). Especially with your desire to use this with O_DIRECT,
we'd better use something like a 2 MB buffer.

Actually, looking at the code, it's completely broken for O_DIRECT
because it g_malloc()s the (then misaligned) buffer. This needs to
become qemu_try_blockalign0(). It's currently failing:

$ ./qemu-img create -f raw ~/tmp/test.raw 64M
Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=67108864
$ ./qemu-img resize --image-opts --preallocation=full 
driver=file,filename=/home/kwolf/tmp/test.raw,cache.direct=on 72M
qemu-img: Could not write zeros for preallocation: Invalid argument

> However with current code, in qemu-img create, we don't have a way to
> force O_DIRECT for the preallocation, and in qemu-img convert the
> preallocation step does not respect the -t none flag. Not using
> O_DIRECT in oVirt is very bad, and likely to cause timeouts in sanlock
> when the kernel flushes the page cache.

If we're talking about regular files, the easy way without changing QEMU
would be to use the same way as in my example above: Create a 0 byte
image file and then resize it with full preallocation.

But I can see that passing a cache mode to image creation is desirable.
So are probably some other options for image creation processes where
the image is actually opened with its block driver.

Possible options I see in the order from an ad-hoc hack to a fully
generic interface:

1. Just add a cache mode create option to file-posix. qemu-img create
   would mostly ignore this option for qcow2 (it would use O_DIRECT for
   creating the raw image file, but not for opening when writing the
   qcow2 data, including preallocation)

   QMP blockde-create would allow you to get the desired behaviour
   because you open the raw layer explicitly there. It just wouldn't be
   available in qemu-img create.

2. Add a cache mode/flags parameter to .bdrv_co_create_opts (the legacy
   callback used for qemu-img create, but not QMP blockdev-create).
   Format drivers would then use that for their bdrv_open() call on the
   protocol layer; file-posix would directly use it for opening the
   file.

3. Add a full BlockdevOptions parameter to .bdrv_co_create_opts() that
   is used for the bdrv_open() of the protocol layer. I guess this would
   bring in features that are currently not accessible from qemu-img
   create because they can't be encoded in a URL but require options
   (such as multiple servers for rbd or gluster).

4. Have some kind of full blockdev-create support in qemu-img create.
   This wouldn't require any changes to the block drivers, but finding a
   good syntax for qemu-img create might not be trivial.

Given that 1-3 are all hacks to the legacy interface, maybe 4 is what we
should be doing to expose the full set of QEMU features in qemu-img
create, even though it poses interface questions that we wouldn't have
to answer with the other options.

> So needed changes are:
> 
> 1. Add a way to control cache in qemu-img create (-t none? -o cache=none?)
> 2. Respect -t none in qemu-img convert -o preallocation falloc

-o cache=none would be what results from my option 1. above.

-t none is what 2. might look like.

For 3. we might get something like --protocol-options cache.direct=on

And finally 4. is not so easy. Maybe something like this:

    --create file,filename=foo.qcow2,open.node-name=img0,open.cache.direct=on
    --create qcow2,file=img0,preallocation=full,size=1G

Actually, on second thoughts, 4. doesn't really work for raw images
without further changes because the preallocation isn't done in a format
driver, but directly during the creation of the protocol driver. :-/

> 3. Replace posix_falloate to fallocate
>     
> https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1868
> 4. Fall back to full zeroing if fallocate is not supported
>     
> https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1891
> 5. Probably use larger zero buffer, 64k is not efficient
>     
> https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1907

I agree with these. Maybe it's an easy first step that can be done while
we're still figuring out how to get O_DIRECT used.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]