qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

OSv guest encountering EC - "Unknown Reason" sync exception (ESR = 0x200


From: Waldek Kozaczuk
Subject: OSv guest encountering EC - "Unknown Reason" sync exception (ESR = 0x2000000) on Raspberry PI 4B host with KVM on
Date: Wed, 18 Nov 2020 12:45:39 -0500

Hi,

I am one of the OSv unikernel developers and I have been stuck trying to figure out the problem described here - https://github.com/cloudius-systems/osv/issues/1100 - for a long time now to no avail. And I am looking for any more suggestions on how to debug it further.

In essence, occasionally (20-30% of the time) OSv guest encounters "Unknown Reason" (ESR_EL1 = 0x2000000) sync exception when running the same test application (does not seem to be specific to any given application). It seems that the longer the application runs the more frequently this exception happens. Here is the state of the registers from the guest gdb debug session captured when exception encountered and after saving exception frame on the stack (sorted for convenience):

(gdb) info reg

ACTLR_EL1      0x0                 0

ACTLR_EL2      0x0                 0

ACTLR_EL3      0x0                 0

AFSR0_EL1      0x0                 0

AFSR1_EL1      0x0                 0

AIDR           0x0                 0

AMAIR0         0x0                 0

CLIDR          0x0                 0

CONTEXTIDR_EL1 0x0                 0

CPACR          0x300000            3145728

CSSELR         0x2                 2

CTR_EL0        0x0                 0

DACR32_EL2     0x1de7ec7edbadc0de  2154950976315703518

DBGBCR         0x0                 0

DBGBVR         0x0                 0

DBGWCR         0x0                 0

DBGWVR         0x0                 0

ELR_EL1        0x1000000311c0      17592186245568

ESR_EL1        0x2000000           33554432

FAR_EL1        0x1000000321b8      17592186249656

FPCR           0x0                 0

FPEXC32_EL2    0x700               1792

FPSR           0x14                20

ID_AA64AFR0_EL1 0x0                0

ID_AA64AFR1_EL1 0x0                0

ID_AA64AFR2_EL1_RESERVED 0x0       0

ID_AA64AFR3_EL1_RESERVED 0x0       0

ID_AA64DFR0_EL1 0x0                0

ID_AA64DFR1_EL1 0x0                0

ID_AA64DFR2_EL1_RESERVED 0x0       0

ID_AA64DFR3_EL1_RESERVED 0x0       0

ID_AA64ISAR0_EL1 0x10000           65536

ID_AA64ISAR1_EL1 0x0               0

ID_AA64ISAR2_EL1_RESERVED 0x0      0

ID_AA64ISAR3_EL1_RESERVED 0x0      0

ID_AA64ISAR4_EL1_RESERVED 0x0      0

ID_AA64ISAR5_EL1_RESERVED 0x0      0

ID_AA64ISAR6_EL1_RESERVED 0x0      0

ID_AA64ISAR7_EL1_RESERVED 0x0      0

ID_AA64MMFR0_EL1 0x1124            4388

ID_AA64MMFR1_EL1 0x0               0

ID_AA64MMFR2_EL1_RESERVED 0x0      0

ID_AA64MMFR3_EL1_RESERVED 0x0      0

ID_AA64MMFR4_EL1_RESERVED 0x0      0

ID_AA64MMFR5_EL1_RESERVED 0x0      0

ID_AA64MMFR6_EL1_RESERVED 0x0      0

ID_AA64MMFR7_EL1_RESERVED 0x0      0

ID_AA64PFR1_EL1 0x0                0

ID_AA64PFR2_EL1_RESERVED 0x0       0

ID_AA64PFR3_EL1_RESERVED 0x0       0

ID_AA64PFR5_EL1_RESERVED 0x0       0

ID_AA64PFR6_EL1_RESERVED 0x0       0

ID_AA64PFR7_EL1_RESERVED 0x0       0

ID_AA64ZFR0_EL1 0x0                0

ID_AFR0        0x0                 0

ID_DFR0        0x0                 0

ID_ISAR0       0x2101110           34607376

ID_ISAR1       0x13112111          319889681

ID_ISAR2       0x21232042          555950146

ID_ISAR3       0x1112131           17899825

ID_ISAR4       0x11142             69954

ID_ISAR5       0x10001             65537

ID_ISAR6       0x0                 0

ID_MMFR0       0x0                 0

ID_MMFR1       0x0                 0

ID_MMFR2       0x0                 0

ID_MMFR3       0x0                 0

ID_MMFR4       0x0                 0

ID_PFR0        0x0                 0

IFSR32_EL2     0x1de7ec7edbadc0de  2154950976315703518

MAIR_EL1       0xff440c0400        1096358298624

MAIR_EL3       0x0                 0

MDCCSR_EL0     0x0                 0

MDCR_EL3       0x0                 0

MDRAR_EL1      0x0                 0

MDSCR_EL1      0x0                 0

MVFR0_EL1      0x10110222          269550114

MVFR1_EL1      0x12111111          303108369

MVFR2_EL1      0x43                67

MVFR3_EL1_RESERVED 0x0             0

MVFR4_EL1_RESERVED 0x0             0

MVFR5_EL1_RESERVED 0x0             0

MVFR6_EL1_RESERVED 0x0             0

MVFR7_EL1_RESERVED 0x0             0

OSLSR_EL1      0xa                 10

PAR_EL1        0x1d000c7edbadc0de  2089683966090920158

PMCCFILTR_EL0  0x0                 0

PMCCNTR_EL0    0x1de7ec7edbadc0de  2154950976315703518

PMCEID0_EL0    0x20001             131073

PMCEID1_EL0    0x0                 0

PMCNTENCLR_EL0 0x5badc0c0          1538113728

PMCNTENSET_EL0 0x5badc0c0          1538113728

PMCR_EL0       0x4102302c          1090662444

PMEVCNTR0_EL0  0x1de7ec7edbadc0de  2154950976315703518

PMEVCNTR1_EL0  0x1de7ec7edbadc0de  2154950976315703518

PMEVCNTR2_EL0  0x1de7ec7edbadc0de  2154950976315703518

PMEVCNTR3_EL0  0x1de7ec7edbadc0de  2154950976315703518

PMEVTYPER0_EL0 0x1de7ec7edbadc0de  2154950976315703518

PMEVTYPER1_EL0 0x1de7ec7edbadc0de  2154950976315703518

PMEVTYPER2_EL0 0x1de7ec7edbadc0de  2154950976315703518

PMEVTYPER3_EL0 0x1de7ec7edbadc0de  2154950976315703518

PMINTENCLR_EL1 0x5badc0c0          1538113728

PMINTENSET_EL1 0x5badc0c0          1538113728

PMOVSCLR_EL0   0x1de7ec7edbadc0de  2154950976315703518

PMOVSSET_EL0   0x1de7ec7edbadc0de  2154950976315703518

PMSELR_EL0     0x1de7ec7edbadc0de  2154950976315703518

PMUSERENR_EL0  0x0                 0

REVIDR_EL1     0x0                 0

RVBAR_EL1      0x0                 0

SCTLR          0x30d0199d          818944413

SPSR_ABT       0x0                 0

SPSR_EL1       0x80000345          2147484485

SPSR_FIQ       0x0                 0

SPSR_IRQ       0x0                 0

SPSR_UND       0x0                 0

SP_EL0         0x0                 0

SP_EL1         0x200000200d60      35184374189408

TCR_EL1        0x15b5103510        93232051472

TPIDRRO_EL0    0x1de7ec7edbadc0de  2154950976315703518

TPIDR_EL0      0xffff900040f78000  -123144212348928

TPIDR_EL1      0x1de7ec7edbadc0de  2154950976315703518

TTBR0_EL1      0x407bf000          1081864192

TTBR1_EL1      0x407b5000          1081823232

VBAR           0x40207000          1075867648

cpsr           0x800003c5          -2147482683

fpcr           0x0                 0

fpsr           0x14                20

pc             0x402077a0          0x402077a0 <entry_invalid_from_sync>

sp             0x200000200d60      0x200000200d60

x0             0x1                 1

x1             0x2000000           33554432

x10            0xffffa00040ca9e00  -105552029245952

x11            0xffffa00040e1a600  -105552027736576

x12            0x100000032048      17592186249288

x13            0x1                 1

x14            0x4100000000        279172874240

x15            0x0                 0

x16            0x100000050000      17592186372096

x17            0x402b0240          1076560448

x18            0x0                 0

x19            0x200000200ed0      35184374189776

x2             0x0                 0

x20            0xffff900040f78000  -123144212348928

x21            0x1                 1

x22            0x1                 1

x23            0xffff900040a2f090  -123144217890672

x24            0x12ca90abac        80707890092

x25            0x0                 0

x26            0xffffa00040f79200  -105552026299904

x27            0xffff8000407cd739  -140736406431943

x28            0xffffa00040f79278  -105552026299784

x29            0x200000200e80      35184374189696

x3             0x0                 0

x30            0x1000000311b8      17592186245560

x4             0x1000000321e2      17592186249698

x5             0x200000200ee2      35184374189794

x6             0x76                118

x7             0x34                52

x8             0x34                52

x9             0x1                 1

  
OSv, being unikernel runs its kernel and application in the same memory space at EL1. 

The most difficult issue is that I do not know why we occasionally encounter this exception. The ARMV8 documentation gives many reasons (https://developer.arm.com/docs/ddi0595/h/aarch64-system-registers/esr_el1#ISS_exceptionswithanunknownreason) but none of the ones I tried to investigate seems to match what I am seeing.

One of the reasons could be unallocated instruction or instruction using not-allowed system register (like MSR, MRS) but that does not seem to be the case. The same application seems to encounter the Uknown Reason exception (if it does) at various places in the code and there does not seem to be anything consistent about it, except it is always the app code and never the kernel part (see addresses above 0x0000100000000000). Here are some examples of instructions per ELR_EL1 (ESR_EL1 is always 0x2000000):

Address,                                   Instruction
---------------------------------------

0x0000100000031000  d63f0020        blr     x1

0x0000100000031140  f9404c01        ldr     x1, [x0, #152]

0x00001000000311c0  94000182        bl      17c8 <*ABS*@plt+0x7b8>

0x0000100000031600  b4000040        cbz     x0, 1608 <*ABS*@plt+0x5f8>

0x0000100000031640  f00000e0        adrp    x0, 20000 <*ABS*@plt+0x1eff0>

0x0000100000031200  2a1503e0        mov     w0, w21

  

As you can see some instructions involve memory access but some do not (last example) and they all seem to be valid instructions.

The existing OSv sync exception handler looks like that:

121 .global entry_invalid_from_sync

122 .type entry_invalid_from_sync, @function

123 entry_invalid_from_sync:

124         mrs     x20, elr_el1       // Exception Link Register -> X20

125         mrs     x21, spsr_el1      // Saved PSTATE -> X21

126         mrs     x22, esr_el1       // Exception Syndrome Register -> X22

127 

128         ubfm    x23, x22, #ESR_EC_BEG, #ESR_EC_END   // Exception Class -> X23

129         ubfm    x24, x22, #ESR_ISS_BEG, #ESR_ISS_END // Instruction-Specific Syndrome -> X24

130 

131 1:      wfi

132         b       1b

133 

134 .global entry_sync

135 .type entry_sync, @function

136 entry_sync:

137         push_state_to_exception_frame

138         mrs     x1, esr_el1

139         ubfm    x2, x1, #ESR_EC_BEG, #ESR_EC_END // Exception Class -> X2

140         ubfm    x3, x1, #ESR_FLT_BEG, #ESR_FLT_END // FLT -> X3

141         cmp     x2, #ESR_EC_DATA_ABORT

142         b.eq    handle_mem_abort

143         cmp     x2, #ESR_EC_INSN_ABORT

144         b.ne    entry_invalid_from_sync

145 handle_mem_abort:

146         cbz     x3, entry_invalid_from_mem_abort

147         cmp     x3, #3

148         b.hi    entry_invalid_from_mem_abort

149 

150         mov     x0, sp  // save exception_frame to x0

151         bl      page_fault

152         pop_state_from_exception_frame

153         eret


which makes OSv hang when encountering the Uknown Reason exception (see line 131,132).

Now I discovered that if I change the entry_invalid_from_sync routine to ignore the exception, pop the frame and let processing continue (like in the modified code below), the app and kernel continues and triggers the exact same exception again in the exact same address (same ELR_EL1 value) over and over again (~5,000 to ~20,000,000 times) UNTIL it eventually ceases to happen and the app successfully completes and OSv terminates. As I wrote at the very beginning of this email, the same app never encounters the EC=0 exception in 70-80% of the runs.

121 .global entry_invalid_from_sync

122 .type entry_invalid_from_sync, @function

123 entry_invalid_from_sync:

124         mov     x0, sp  // save exception_frame to x0

125         bl      entry_sync_invalid_handler

126         pop_state_from_exception_frame

127         eret


Any ideas on what could be wrong or on how to further debug this problem? I can not see anything interesting in the host kern.log nor in QEMU log (qemu.log). 

Please note that I have never been able to reproduce this issue with the same code (kernel and app) in the emulated mode (TCG) so it always seems to happen with KVM acceleration only.

Below is some information about specifics of the host, qemu, etc:

The host is Ubuntu 20.04.1 running Raspberry PI 4B with 4GB of RAM booting from SSD:

qemu-system-aarch64 --version

QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.8)

Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers


uname -a

Linux ubuntu 5.4.0-1022-raspi #25-Ubuntu SMP PREEMPT Thu Oct 15 13:31:49 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux


lscpu

Architecture:                    aarch64

CPU op-mode(s):                  32-bit, 64-bit

Byte Order:                      Little Endian

CPU(s):                          4

On-line CPU(s) list:             0-3

Thread(s) per core:              1

Core(s) per socket:              4

Socket(s):                       1

Vendor ID:                       ARM

Model:                           3

Model name:                      Cortex-A72

Stepping:                        r0p3

CPU max MHz:                     1500.0000

CPU min MHz:                     600.0000

BogoMIPS:                        108.00

Vulnerability Itlb multihit:     Not affected

Vulnerability L1tf:              Not affected

Vulnerability Mds:               Not affected

Vulnerability Meltdown:          Not affected

Vulnerability Spec store bypass: Vulnerable

Vulnerability Spectre v1:        Mitigation; __user pointer sanitization

Vulnerability Spectre v2:        Vulnerable

Vulnerability Srbds:             Not affected

Vulnerability Tsx async abort:   Not affected

Flags:                           fp asimd evtstrm crc32 cpuid


The qemu comman line:

qemu-system-aarch64 \

-m 128M \

-smp 1 \

--nographic \

-gdb tcp::1234,server,nowait \

-kernel /home/wkozaczuk/projects/osv/build/last/loader.img \

-append "--disable_rofs_cache /tests/tst-tls.so" \

-machine virt \

-device virtio-blk-pci,id=blk0,drive=hd0,scsi=off \

-drive file=/home/wkozaczuk/projects/osv/build/release/tests_disk.img,if=none,id=hd0,cache=none,aio=native \

-netdev user,id=un0,net=192.168.122.0/24,host=192.168.122.1 \

-device virtio-net-pci,netdev=un0 \

-device virtio-rng-pci \

-enable-kvm \

-cpu host \

-d all \

-D ./qemu.log


Here is also the OSv guest memory layout of the kernel, application, and app stacks just in case.

0x0000000040090000 0x00000000407ac000           [KERNEL]


0x0000100000000000 0x0000100000001000 [4.0 kB]         flags=fmF      perm=rx   offset=0x00000000 path=/libvdso.so

0x000010000001f000 0x0000100000020000 [4.0 kB]         flags=fmF      perm=r    offset=0x0000f000 path=/libvdso.so

0x0000100000020000 0x0000100000021000 [4.0 kB]         flags=fmF      perm=rw   offset=0x00010000 path=/libvdso.so

0x0000100000030000 0x0000100000033000 [12.0 kB]        flags=fmF      perm=rx   offset=0x00000000 path=/tests/tst-tls.so

0x000010000004f000 0x0000100000050000 [4.0 kB]         flags=fmF      perm=r    offset=0x0000f000 path=/tests/tst-tls.so

0x0000100000050000 0x0000100000051000 [4.0 kB]         flags=fmF      perm=rw   offset=0x00010000 path=/tests/tst-tls.so

0x0000100000060000 0x0000100000061000 [4.0 kB]         flags=fmF      perm=rx   offset=0x00000000 path=/tests/libtls.so

0x000010000007f000 0x0000100000080000 [4.0 kB]         flags=fmF      perm=r    offset=0x0000f000 path=/tests/libtls.so

0x0000100000080000 0x0000100000081000 [4.0 kB]         flags=fmF      perm=rw   offset=0x00010000 path=/tests/libtls.so

0x0000100000090000 0x00001000000a3000 [76.0 kB]        flags=fmF      perm=rx   offset=0x00000000 path=/usr/lib/libgcc_s.so.1

0x00001000000bf000 0x00001000000c0000 [4.0 kB]         flags=fmF      perm=r    offset=0x0001f000 path=/usr/lib/libgcc_s.so.1

0x00001000000c0000 0x00001000000c1000 [4.0 kB]         flags=fmF      perm=rw   offset=0x00020000 path=/usr/lib/libgcc_s.so.1


0x0000200000000000 0x0000200000001000 [4.0 kB]         flags=p        perm=none

0x0000200000001000 0x0000200000002000 [4.0 kB]         flags=p        perm=none

0x0000200000002000 0x0000200000101000 [1020.0 kB]      flags=p        perm=rw  

0x0000200000101000 0x0000200000102000 [4.0 kB]         flags=p        perm=none

0x0000200000102000 0x0000200000201000 [1020.0 kB]      flags=p        perm=rw  


My regards,
Waldemar Kozaczuk


reply via email to

[Prev in Thread] Current Thread [Next in Thread]