qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dropping posix_fallocate for -o preallocation falloc


From: Nir Soffer
Subject: Dropping posix_fallocate for -o preallocation falloc
Date: Sun, 23 Aug 2020 17:46:49 +0300

Using -o preallocation falloc works great on NFS 4.2 and local file system,
when fallocate() is supported, but when it is not, posix_fallocate falls back
to very inefficient way:
https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96

This will read the last byte for every 4k block, and if the byte is
null, write one
null byte.

This minimizes the amount of data sent over the wire, but is very slow.

In file-posix we optimize this flow by not truncating the file to the
final size,
so this will only write one null byte for every 4k block, but this is
still very slow.

Except the poor performance, we have a bug showing that for some reason,
this does not work well with OFD locking:
https://bugzilla.redhat.com/1851097

In oVirt 4.4.2 we avoid the issue by not using -o preallocation
falloc. Instead we
use our own fallocate helper:
https://github.com/oVirt/vdsm/blob/master/helpers/fallocate

(We got feedback that the name of this helper is confusing since it does
destructive operation when fallocate() is not supported. We will
change the name)

This helper is similar to posix_fallocate, but instead of falling back
to writing
one byte per 4k block, it falls back to writing zeros in large blocks.

Testing shows that this improves fallocation time by 385% for one disk, and
468% for 10 concurrent disk preallocation:
https://bugzilla.redhat.com/1850267#c25

I think the next step is to move this change into qemu, so all users can
benefit from this change.

I think the way to do this is to replace posix_fallocate() with fallocate(),
and fallback to "full" preallocation if fallocate is not supported.

However with current code, in qemu-img create, we don't have a way to force
O_DIRECT for the preallocation, and in qemu-img convert the preallocation step
does not respect the -t none flag. Not using O_DIRECT in oVirt is very
bad, and likely
to cause timeouts in sanlock when the kernel flushes the page cache.

So needed changes are:

1. Add a way to control cache in qemu-img create (-t none? -o cache=none?)
2. Respect -t none in qemu-img convert -o preallocation falloc
3. Replace posix_falloate to fallocate
    
https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1868
4. Fall back to full zeroing if fallocate is not supported
    
https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1891
5. Probably use larger zero buffer, 64k is not efficient
    
https://github.com/qemu/qemu/blob/152be6de9100e58b5d896272e951d4c910bd735a/block/file-posix.c#L1907

What do you think?

Nir




reply via email to

[Prev in Thread] Current Thread [Next in Thread]