[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v4 00/10] Optimize buffer_is_zero
From: |
Alexander Monakov |
Subject: |
Re: [PATCH v4 00/10] Optimize buffer_is_zero |
Date: |
Fri, 16 Feb 2024 23:20:09 +0300 (MSK) |
On Thu, 15 Feb 2024, Richard Henderson wrote:
> On 2/15/24 13:37, Alexander Monakov wrote:
> > Ah, I guess you might be running at low perf_event_paranoid setting that
> > allows unprivileged sampling of kernel events? In our submissions the
> > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> > excluding kernel time under syscalls.
>
> Ok. Eliminating kernel samples makes things easier to see.
> But I still do not see a 40% reduction in runtime.
I suspect Mikhail's image was less sparse, so the impact from inlining
was greater.
> With this, I see virtually all of the runtime in libz.so.
> Therefore I converted this to raw first, to focus on the issue.
Ah, apologies for that. I built with --disable-default-features and
did not notice my qemu-img lacked support for vmdk and treated it
as a raw image instead. I was assuming it was similar to what Mikhail
used, but obviously it's not due to the compression.
> For avoidance of doubt:
>
> $ ls -lsh test.raw && sha256sum test.raw
> 12G -rw-r--r-- 1 rth rth 40G Feb 15 21:14 test.raw
> 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80 test.raw
>
> Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
> Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
>
> master:
> 57.48% qemu-img-m [.] buffer_zero_avx2
> 3.60% qemu-img-m [.] is_allocated_sectors.part.0
> 2.61% qemu-img-m [.] buffer_is_zero
> 63.69% -- total
>
> v3:
> 48.86% qemu-img-v3 [.] is_allocated_sectors.part.0
> 3.79% qemu-img-v3 [.] buffer_zero_avx2
> 52.65% -- total
> -17% -- reduction from master
>
> v4:
> 54.60% qemu-img-v4 [.] buffer_is_zero_ge256
> 3.30% qemu-img-v4 [.] buffer_zero_avx2
> 3.17% qemu-img-v4 [.] is_allocated_sectors.part.0
> 61.07% -- total
> -4% -- reduction from master
>
> v4+:
> 46.65% qemu-img [.] is_allocated_sectors.part.0
> 3.49% qemu-img [.] buffer_zero_avx2
> 0.05% qemu-img [.] buffer_is_zero_ge256
> 50.19% -- total
> -21% -- reduction from master
Any ideas where the -21% vs v3's -17% difference comes from?
FWIW, in situations like these I always recommend to run perf with fixed
sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
cycles/period=100000/P' to make sample counts between runs of different
duration directly comparable (displayed with 'perf report -n').
> The v4+ puts the 3 byte test back inline, like in your v3.
>
> Importantly, it must be as 3 short-circuting tests, where my v4 "simplified"
> this to (s | m | e) != 0, on the assumption that the reduced number of
> branches would help.
Yes, we also noticed that when preparing our patch. We also tried mixed
variants like (s | e) != 0 || m != 0, but they did not turn out faster.
> With that settled, I guess we need to talk about how much the out-of-line
> implementation matters at all. I'm thinking about writing a
> test/bench/bufferiszero, with all-zero buffers of various sizes and
> alignments. With that it would be easier to talk about whether any given
> implementation is is an improvement for that final 4% not eliminated by the
> three bytes.
Yeah, initially I suggested this task to Mikhail as a practice exercise
outside of Qemu, and we had a benchmark that measures buffer_is_zero via
perf_event_open. This allows to see exactly how close the implementation
runs to the performance ceiling given by max L1 fetch rate (two loads
per cycle on x86).
Alexander
- Re: [PATCH v4 08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel, (continued)
- [RFC PATCH v4 10/10] util/bufferiszero: Add sve acceleration for aarch64, Richard Henderson, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Alexander Monakov, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Alexander Monakov, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Alexander Monakov, 2024/02/15
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/02/16
- Re: [PATCH v4 00/10] Optimize buffer_is_zero,
Alexander Monakov <=
- Re: [PATCH v4 00/10] Optimize buffer_is_zero, Richard Henderson, 2024/02/16