[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH v2 00/33] accel/tcg + target/arm: pc-relative translation
From: |
Richard Henderson |
Subject: |
[PATCH v2 00/33] accel/tcg + target/arm: pc-relative translation |
Date: |
Tue, 16 Aug 2022 15:33:27 -0500 |
Supercedes: 20220812180806.2128593-1-richard.henderson@linaro.org
("accel/tcg: minimize tlb lookups during translate + user-only PROT_EXEC fixes")
A few changes to the PROT_EXEC work that I posted last week, and
then continuing to the main event.
My initial goal was to reduce the overhead of TB flushing, which
Alex Bennee identified as a significant issue with respect to
booting AArch64 kernels under avocado. Our initial guess was that
we need a more efficient data structure for walking TBs associated
with a physical page.
While I was looking at some of those numbers, I noted that we were
seeing up to 16000 TBs attached to a single page, which is well more
than I expected to see, and means that a new data structure isn't
going to help as much as simply reducing the number of translations.
It turns out the retranslation is due to the guest kernel's userland
address space randomization. Each process gets e.g. libc mapped to
a different virtual address, which caused a new translation.
This, then, introduces some infrastructure for writing "pc-relative"
translation blocks, in which the guest pc is treated as a variable
just like any other guest cpu register. The hashing for these TBs
are adjusted to compare the physical address. The target/arm backend
is adjusted to use the new feature.
This does result in a significant reduction in translation. From the
BootLinuxAarch64.test_virt_tcg_gicv2 test, at the login prompt:
Before:
gen code size 160684739/1073736704
TB count 289808
TB flush count 1
TB invalidate count 235143
After:
gen code size 277992547/1073736704
TB count 503882
TB flush count 0
TB invalidate count 69282
Before TARGET_TB_PCREL, we generate approximately 1.1GB of TBs
(overflow 1GB, flush, and fill 153MB again). Afterward, we only
generate 265MB of TBs.
Surprisingly, this does not affect wall-clock times nearly as
much as I would have expected:
before after change
BootLinuxAarch64.test_virt_tcg_gicv2: 97.35 85.11 -12%
BootLinuxAarch64.test_virt_tcg_gicv3: 102.75 96.87 -5%
Change in profile, top 10 entries before, matched up with after:
before after
9.01% qemu-system-aar [.] helper_lookup_tb_ptr 10.67%
4.92% qemu-system-aar [.] qht_lookup_custom 5.06%
4.79% qemu-system-aar [.] get_phys_addr_lpae 5.24%
2.57% qemu-system-aar [.] address_space_ldq_le 2.77%
2.33% qemu-system-aar [.] liveness_pass_1 0.60%
2.24% qemu-system-aar [.] cpu_get_tb_cpu_state 2.58%
1.76% qemu-system-aar [.] address_space_translate_internal 1.75%
1.71% qemu-system-aar [.] tb_lookup_cmp 1.92%
1.65% qemu-system-aar [.] tcg_gen_code 0.44%
1.64% qemu-system-aar [.] do_tb_phys_invalidate 0.09%
r~
Ilya Leoshkevich (1):
accel/tcg: Introduce is_same_page()
Richard Henderson (32):
linux-user/arm: Mark the commpage executable
linux-user/hppa: Allocate page zero as a commpage
linux-user/x86_64: Allocate vsyscall page as a commpage
linux-user: Honor PT_GNU_STACK
tests/tcg/i386: Move smc_code2 to an executable section
accel/tcg: Remove PageDesc code_bitmap
accel/tcg: Use bool for page_find_alloc
accel/tcg: Make tb_htable_lookup static
accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c
accel/tcg: Properly implement get_page_addr_code for user-only
accel/tcg: Use probe_access_internal for softmmu
get_page_addr_code_hostp
accel/tcg: Add nofault parameter to get_page_addr_code_hostp
accel/tcg: Unlock mmap_lock after longjmp
accel/tcg: Raise PROT_EXEC exception early
accel/tcg: Remove translator_ldsw
accel/tcg: Add pc and host_pc params to gen_intermediate_code
accel/tcg: Add fast path for translator_ld*
accel/tcg: Use DisasContextBase in plugin_gen_tb_start
accel/tcg: Do not align tb->page_addr[0]
include/hw/core: Create struct CPUJumpCache
accel/tcg: Introduce tb_pc and tb_pc_log
accel/tcg: Introduce TARGET_TB_PCREL
accel/tcg: Split log_cpu_exec into inline and slow path
target/arm: Introduce curr_insn_len
target/arm: Change gen_goto_tb to work on displacements
target/arm: Change gen_*set_pc_im to gen_*update_pc
target/arm: Change gen_exception_insn* to work on displacements
target/arm: Change gen_exception_internal to work on displacements
target/arm: Change gen_jmp* to work on displacements
target/arm: Introduce gen_pc_plus_diff for aarch64
target/arm: Introduce gen_pc_plus_diff for aarch32
target/arm: Enable TARGET_TB_PCREL
include/elf.h | 1 +
include/exec/cpu-common.h | 1 +
include/exec/cpu-defs.h | 3 +
include/exec/exec-all.h | 138 +++++++-------
include/exec/plugin-gen.h | 7 +-
include/exec/translator.h | 85 +++++++--
include/hw/core/cpu.h | 9 +-
linux-user/arm/target_cpu.h | 4 +-
linux-user/qemu.h | 1 +
target/arm/cpu-param.h | 2 +
target/arm/translate-a32.h | 2 +-
target/arm/translate.h | 21 ++-
accel/tcg/cpu-exec.c | 222 +++++++++++++---------
accel/tcg/cputlb.c | 98 +++-------
accel/tcg/plugin-gen.c | 23 +--
accel/tcg/translate-all.c | 197 +++++++-------------
accel/tcg/translator.c | 122 +++++++++---
accel/tcg/user-exec.c | 15 ++
linux-user/elfload.c | 81 +++++++-
softmmu/physmem.c | 12 ++
target/alpha/translate.c | 5 +-
target/arm/cpu.c | 23 +--
target/arm/translate-a64.c | 174 ++++++++++-------
target/arm/translate-m-nocp.c | 6 +-
target/arm/translate-mve.c | 2 +-
target/arm/translate-vfp.c | 10 +-
target/arm/translate.c | 237 +++++++++++++++---------
target/avr/cpu.c | 2 +-
target/avr/translate.c | 5 +-
target/cris/translate.c | 5 +-
target/hexagon/cpu.c | 2 +-
target/hexagon/translate.c | 6 +-
target/hppa/cpu.c | 4 +-
target/hppa/translate.c | 5 +-
target/i386/tcg/tcg-cpu.c | 2 +-
target/i386/tcg/translate.c | 7 +-
target/loongarch/cpu.c | 2 +-
target/loongarch/translate.c | 6 +-
target/m68k/translate.c | 5 +-
target/microblaze/cpu.c | 2 +-
target/microblaze/translate.c | 5 +-
target/mips/tcg/exception.c | 2 +-
target/mips/tcg/sysemu/special_helper.c | 2 +-
target/mips/tcg/translate.c | 5 +-
target/nios2/translate.c | 5 +-
target/openrisc/cpu.c | 2 +-
target/openrisc/translate.c | 6 +-
target/ppc/translate.c | 5 +-
target/riscv/cpu.c | 4 +-
target/riscv/translate.c | 5 +-
target/rx/cpu.c | 2 +-
target/rx/translate.c | 5 +-
target/s390x/tcg/translate.c | 5 +-
target/sh4/cpu.c | 4 +-
target/sh4/translate.c | 5 +-
target/sparc/cpu.c | 2 +-
target/sparc/translate.c | 5 +-
target/tricore/cpu.c | 2 +-
target/tricore/translate.c | 6 +-
target/xtensa/translate.c | 6 +-
tcg/tcg.c | 6 +-
tests/tcg/i386/test-i386.c | 2 +-
62 files changed, 979 insertions(+), 666 deletions(-)
--
2.34.1
- [PATCH v2 00/33] accel/tcg + target/arm: pc-relative translation,
Richard Henderson <=
- [PATCH v2 01/33] linux-user/arm: Mark the commpage executable, Richard Henderson, 2022/08/16
- [PATCH v2 02/33] linux-user/hppa: Allocate page zero as a commpage, Richard Henderson, 2022/08/16
- [PATCH v2 03/33] linux-user/x86_64: Allocate vsyscall page as a commpage, Richard Henderson, 2022/08/16
- [PATCH v2 05/33] tests/tcg/i386: Move smc_code2 to an executable section, Richard Henderson, 2022/08/16
- [PATCH v2 06/33] accel/tcg: Remove PageDesc code_bitmap, Richard Henderson, 2022/08/16
- [PATCH v2 11/33] accel/tcg: Use probe_access_internal for softmmu get_page_addr_code_hostp, Richard Henderson, 2022/08/16
- [PATCH v2 09/33] accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c, Richard Henderson, 2022/08/16
- [PATCH v2 08/33] accel/tcg: Make tb_htable_lookup static, Richard Henderson, 2022/08/16
- [PATCH v2 04/33] linux-user: Honor PT_GNU_STACK, Richard Henderson, 2022/08/16