[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block
From: |
Nir Soffer |
Subject: |
Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block |
Date: |
Thu, 22 Aug 2019 19:39:47 +0300 |
On Thu, Aug 22, 2019 at 5:28 PM Max Reitz <address@hidden> wrote:
> On 16.08.19 23:21, Nir Soffer wrote:
> > When creating an image with preallocation "off" or "falloc", the first
> > block of the image is typically not allocated. When using Gluster
> > storage backed by XFS filesystem, reading this block using direct I/O
> > succeeds regardless of request length, fooling alignment detection.
> >
> > In this case we fallback to a safe value (4096) instead of the optimal
> > value (512), which may lead to unneeded data copying when aligning
> > requests. Allocating the first block avoids the fallback.
> >
> > When using preallocation=off, we always allocate at least one filesystem
> > block:
> >
> > $ ./qemu-img create -f raw test.raw 1g
> > Formatting 'test.raw', fmt=raw size=1073741824
> >
> > $ ls -lhs test.raw
> > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
> >
> > I did quick performance tests for these flows:
> > - Provisioning a VM with a new raw image.
> > - Copying disks with qemu-img convert to new raw target image
> >
> > I installed Fedora 29 server on raw sparse image, measuring the time
> > from clicking "Begin installation" until the "Reboot" button appears:
> >
> > Before(s) After(s) Diff(%)
> > -------------------------------
> > 356 389 +8.4
> >
> > I ran this only once, so we cannot tell much from these results.
>
> So you’d expect it to be fast but it was slower? Well, you only ran it
> once and it isn’t really a precise benchmark...
>
> > The second test was cloning the installation image with qemu-img
> > convert, doing 10 runs:
> >
> > for i in $(seq 10); do
> > rm -f dst.raw
> > sleep 10
> > time ./qemu-img convert -f raw -O raw -t none -T none src.raw
> dst.raw
> > done
> >
> > Here is a table comparing the total time spent:
> >
> > Type Before(s) After(s) Diff(%)
> > ---------------------------------------
> > real 530.028 469.123 -11.4
> > user 17.204 10.768 -37.4
> > sys 17.881 7.011 -60.7
> >
> > Here we see very clear improvement in CPU usage.
> >
> > Signed-off-by: Nir Soffer <address@hidden>
> > ---
> > block/file-posix.c | 25 +++++++++++++++++++++++++
> > tests/qemu-iotests/150.out | 1 +
> > tests/qemu-iotests/160 | 4 ++++
> > tests/qemu-iotests/175 | 19 +++++++++++++------
> > tests/qemu-iotests/175.out | 8 ++++----
> > tests/qemu-iotests/221.out | 12 ++++++++----
> > tests/qemu-iotests/253.out | 12 ++++++++----
> > 7 files changed, 63 insertions(+), 18 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index b9c33c8f6c..3964dd2021 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1755,6 +1755,27 @@ static int handle_aiocb_discard(void *opaque)
> > return ret;
> > }
> >
> > +/*
> > + * Help alignment detection by allocating the first block.
> > + *
> > + * When reading with direct I/O from unallocated area on Gluster backed
> by XFS,
> > + * reading succeeds regardless of request length. In this case we
> fallback to
> > + * safe aligment which is not optimal. Allocating the first block
> avoids this
> > + * fallback.
> > + *
> > + * Returns: 0 on success, -errno on failure.
> > + */
> > +static int allocate_first_block(int fd)
> > +{
> > + ssize_t n;
> > +
> > + do {
> > + n = pwrite(fd, "\0", 1, 0);
>
> This breaks when fd has been opened with O_DIRECT.
>
It seems that we always open images without O_DIRECT when creating an image
in qemu-img create, or when creating a target image in qemu-img convert.
Here is a trace of qemu-img create:
$ strace -f -tt -o /tmp/create.trace ./qemu-img create -f raw -o
preallocation=falloc /tmp/gv1/src.raw 1g
Formatting '/tmp/gv1/src.raw', fmt=raw size=1073741824 preallocation=falloc
1. open #1
17686 18:58:23.681921 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.683753 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.683829 close(9) = 0
2. open #2
17686 18:58:23.684146 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 9
17686 18:58:23.684227 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17686 18:58:23.684256 close(9) = 0
3. open #3
17686 18:58:23.684339 openat(AT_FDCWD, "/tmp/gv1/src.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 9
...
17688 18:58:23.690178 fstat(9, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
17688 18:58:23.690217 ftruncate(9, 0 <unfinished ...>
...
17688 18:58:23.700266 <... ftruncate resumed>) = 0
...
17688 18:58:23.700595 fstat(9, <unfinished ...>
...
17688 18:58:23.700619 <... fstat resumed>{st_mode=S_IFREG|0600, st_size=0,
...}) = 0
...
17688 18:58:23.700651 fallocate(9, 0, 0, 1073741824 <unfinished ...>
...
17688 18:58:23.728141 <... fallocate resumed>) = 0
...
17688 18:58:23.728196 pwrite64(9, "\0", 1, 0) = 1
...
17686 18:58:23.738391 close(9) = 0
Here is convert trace:
$ strace -f -tt -o /tmp/convert.trace ./qemu-img convert -f raw -O raw -t
none -T none /tmp/gv1/src.raw /tmp/gv1/dst.raw
1. open #1
18175 19:07:23.364417 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365282 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365323 close(10) = 0
2. open #2
18175 19:07:23.365660 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.365717 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.365742 close(10) = 0
3. open #3
18175 19:07:23.365839 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
...
18177 19:07:23.372112 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18177 19:07:23.372138 ftruncate(10, 0) = 0
...
18177 19:07:23.375760 fstat(10, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
18177 19:07:23.375788 ftruncate(10, 1073741824) = 0
18177 19:07:23.376383 pwrite64(10, "\0", 1, 0) = 1
...
18175 19:07:23.390989 close(10) = 0
4. open #4
18175 19:07:23.391429 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 10
18175 19:07:23.392433 fstat(10, {st_mode=S_IFREG|0600, st_size=1073741824,
...}) = 0
18175 19:07:23.392483 close(10) = 0
5. open #5
18175 19:07:23.392743 openat(AT_FDCWD, "/tmp/gv1/dst.raw",
O_RDWR|O_DIRECT|O_CLOEXEC) = 10
...
18175 19:07:23.393731 ioctl(10, BLKSSZGET, 0x7ffe75ead334) = -1 ENOSYS
(Function not implemented)
18175 19:07:23.393784 pread64(10, 0x558796451000, 1, 0) = -1 EINVAL
(Invalid argument)
18175 19:07:23.395362 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
0) = 512
18175 19:07:23.395905 pread64(10,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
4096, 0) = 4096
...
(Which happens when you open some file with cache.direct=on, and then
> use e.g. QMP’s block_resize.)
>
What would be a command triggering this? I can add a test.
It isn’t that bad because eventually you simply ignore the error. But
> it still makes me wonder whether we shouldn’t write like the biggest
> power of two that does not exceed the new file length or MAX_BLOCKSIZE.
>
It makes sense if there is a way to cause qemu-img to use O_DIRECT when
creating an image.
> + } while (n == -1 && errno == EINTR);
> > +
> > + return (n == -1) ? -errno : 0;
> > +}
> > +
> > static int handle_aiocb_truncate(void *opaque)
> > {
> > RawPosixAIOData *aiocb = opaque;
> > @@ -1794,6 +1815,8 @@ static int handle_aiocb_truncate(void *opaque)
> > /* posix_fallocate() doesn't set errno. */
> > error_setg_errno(errp, -result,
> > "Could not preallocate new data");
> > + } else if (current_length == 0) {
> > + allocate_first_block(fd);
>
> Should posix_fallocate() not take care of precisely this?
>
Only if the filesystem does not support fallocate() (e.g. NFS < 4.2).
In this case posix_fallocate() is doing:
for (offset += (len - 1) % increment; len > 0; offset += increment)
{
len -= increment;
if (offset < st.st_size)
{
unsigned char c;
ssize_t rsize = __pread (fd, &c, 1, offset);
if (rsize < 0)
return errno;
/* If there is a non-zero byte, the block must have been
allocated already. */
else if (rsize == 1 && c != 0)
continue;
}
if (__pwrite (fd, "", 1, offset) != 1)
return errno;
}
https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
So opening a file with O_DIRECT will break preallocation=falloc on such
filesystems,
and writing one byte in allocate_first_block() is safe.
> > }
> > } else {
> > result = 0;
>
> [...]
>
> > diff --git a/tests/qemu-iotests/160 b/tests/qemu-iotests/160
> > index df89d3864b..ad2d054a47 100755
> > --- a/tests/qemu-iotests/160
> > +++ b/tests/qemu-iotests/160
> > @@ -57,6 +57,10 @@ for skip in $TEST_SKIP_BLOCKS; do
> > $QEMU_IMG dd if="$TEST_IMG" of="$TEST_IMG.out" skip="$skip" -O
> "$IMGFMT" \
> > 2> /dev/null
> > TEST_IMG="$TEST_IMG.out" _check_test_img
> > +
> > + # We always write the first byte of an image.
> > + printf "\0" > "$TEST_IMG.out.dd"
> > +
> > dd if="$TEST_IMG" of="$TEST_IMG.out.dd" skip="$skip" status=none
>
> Won’t this dd completely overwrite $TEST_IMG.out.dd (especially given
> the lack of conv=notrunc)?
>
There is an issue only if dd open the file with O_TRUNC. I will test this
again.
>
> > echo
> > diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
> > index 51e62c8276..c6a3a7bb1e 100755
> > --- a/tests/qemu-iotests/175
> > +++ b/tests/qemu-iotests/175
> > @@ -37,14 +37,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
> > # the file size. This function hides the resulting difference in the
> > # stat -c '%b' output.
> > # Parameter 1: Number of blocks an empty file occupies
> > -# Parameter 2: Image size in bytes
> > +# Parameter 2: Minimal number of blocks in an image
> > +# Parameter 3: Image size in bytes
> > _filter_blocks()
> > {
> > extra_blocks=$1
> > - img_size=$2
> > + min_blocks=$2
> > + img_size=$3
> >
> > - sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/"
> \
> > - -e "s/blocks=$((extra_blocks + img_size /
> 512))\\(\$\\|[^0-9]\\)/everything allocated/"
> > + sed -e "s/blocks=$((extra_blocks +
> min_blocks))\\(\$\\|[^0-9]\\)/min allocation/" \
>
> I don’t think adding extra_blocks to min_blocks makes sense. Just
> min_blocks alone should be what we want here.
>
We had failing tests (in vdsm) showing that filesystem may return more
blocs than
expected even for non-empty files, so this may be needed. I did not test
it yet with a filesystem that show this issue, but I found your instructions
how to create it.
Thanks for reviewing.
Nir
- [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/16
- Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block, John Snow, 2019/08/16
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/22
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Max Reitz, 2019/08/22
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block,
Nir Soffer <=
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Max Reitz, 2019/08/22
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/22
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Max Reitz, 2019/08/23
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/23
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Max Reitz, 2019/08/23
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/23
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Max Reitz, 2019/08/23
- Re: [Qemu-devel] [PATCH] block: posix: Always allocate the first block, Nir Soffer, 2019/08/24
Re: [Qemu-devel] [Qemu-block] [PATCH] block: posix: Always allocate the first block, Maxim Levitsky, 2019/08/25