qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Enabling internal errors for VH CXL devices: [was: Re: Questions abo


From: Terry Bowman
Subject: Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
Date: Wed, 6 Mar 2024 11:12:11 -0600
User-agent: Mozilla Thunderbird

Hi Yuquan an Jon,

I added responses inline below.

On 3/6/24 07:23, Jonathan Cameron wrote:
> On Wed, 6 Mar 2024 19:27:07 +0800
> Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:
> 
>> Hello, Jonathan
>>
>> Recently I met some problems on CXL RAS tests. 
>>
>> I tried to use "cxl-inject-uncorrectable-errors" and 
>> "cxl-inject-correctable-error"
>> qmp to inject CXL errors, however, there was no any kernel printing 
>> information in 
>> my qemu machine. And the qmp connection was unstable that made the machine 
>> always "terminating on signal 2".
> 
> The qmp connection being unstable is odd - might be related to the CXL code, 
> but
> I'm not sure how..
> 
>>
>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same 
>> conditions.
>> The kernel showed relevant print information.
> 
> IIRC the AER paths print under all circumstances whereas CXL errors do not, 
> they simply
> trigger tracepoints - but you should have seen device resets.
> 
> However I span up a test and I think the issue is more straight forward.
> The uncorrectable internal error and correctable internal errors are masked 
> on the device.
> I thought we changed the default on this in linux but maybe not :(
> 

Device AER UIE/CIE mask can be set and still expect to handle device AER 
errors. The device reports 
AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc errors. 

In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to 
properly receive 
AER UIE/CI notifications from devices and RCH dports.

"CXL Protocol and Link errors detected by components that are part of a CXL VH 
are
escalated and reported using standard PCIe error reporting mechanisms over 
CXL.io as
UIEs and/or CIEs. See PCIe Base Specification for details."[1]

[1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting

> Hack is fine the relevant device with lspci -tv and then use
> setpci -s 0d:00.0 0x208.l=0
> to clear all the mask bits for uncorrectable errors.
> 
> Note I tested this on a convenient arm64 setup so always possible there is yet
> another problem on x86.
> 
> Robert / Terry, I tracked down the patch where you enabled this for RCHs and 
> there was
> some discussion on walking out on VH as well to enable this, but seems it
> never happened. Can you remember why?  Just kicked back for a future occasion?
> 
> Jonathan
> 
> 

I tested (qemu x86) using the aer-inject tool and found it to work. Below shows 
the 
endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly 
handled
with root port logging and cxl_pci handler trace logs.

 # lspci | grep -i cxl                                                          
                                                                           
    0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)                         
                                                                              
                                                                                
                                                                              
    # lspci -s 0d:00.0 -vvv | grep Advanced                                     
                                                                              
    Capabilities: [200 v2] Advanced Error Reporting                             
                                                                              
                                                                                
                                                                              
    # setpci -s 0d:00.0 0x208.l                                                 
                                                                              
    02400000                                                                    
                                                                              
                                                                                
                                                                              
    # setpci -s 0d:00.0 0x214.l                                                 
                                                                              
    0000e000                                                                    
                                                                              
                                                                                
                                                                              
    # cat aer-input.txt                                                         
                                                                              
    # Inject a correctable bad TLP error into the device with header log        
                                                                              
    # words 0 1 2 3.                                                            
                                                                              
    #                                                                           
                                                                              
    # Either specify the PCI id on the command-line option or uncomment and 
edit                                                                            
  
    # the PCI_ID line below using the correct PCI ID.                           
                                                                              
    #                                                                           
                                                                              
    # Note that system firmware/BIOS may mask certain errors and/or not report  
                                                                              
    # header log words.                                                         
                                                                              
    #                                                                           
                                                                              
    AER                                                                         
                                                                              
    #PCI_ID 0000:0C.00.0                                                        
                                                                              
    COR_STATUS BAD_TLP                                                          
                                                                              
    HEADER_LOG 0 1 2 3                                                          
                                                                              
                                                                                
                                                                              
    # ./aer-inject -s 0000:0d:00.0 aer-input.txt                                
                                                                              
    [   72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors 
00000040/00000000 into device 0000:0d:00.0                                      
       
    [   72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: 
0000:0d:00.0                                                                    
     
    [   72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, 
type=Data Link Layer, (Receiver ID)                                             
 
    [   72.853591] cxl_pci 0000:0d:00.0:   device [8086:0d93] error 
status/mask=00000040/0000e000                                             
    # [   72.854277] cxl_pci 0000:0d:00.0:    [ 6] BadTLP      

I have not tried to use cxl-inject-uncorrectable-errors or 
cxl-inject-correctable-error.

Regards,
Terry

>>
>> Question:
>> 1) Is my CXL RAS test operations standard?
>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" 
>> of cxl.io?
>>    The error injected by "cxl-inject-uncorrectable-errors" or 
>> "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
>>
>> Hope I can get some helps here, any help will be greatly appreciated.
>>
>>
>> My qemu command line:
>> qemu-system-x86_64 \
>> -M q35,nvdimm=on,cxl=on \
>> -m 4G \
>> -smp 4 \
>> -object memory-backend-ram,size=2G,id=mem0 \
>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
>> -object memory-backend-ram,size=2G,id=mem1 \
>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
>> -object memory-backend-ram,size=256M,id=cxl-mem0 \
>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
>> -M 
>> cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
>>  \
>> -hda ../disk/ubuntu_x86_test_new.qcow2 \
>> -nographic \
>> -qmp tcp:127.0.0.1:4444,server,nowait \
>>
>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in 
>> "https://gitlab.com/jic23/qemu"; 
>> Kernel version: 6.8.0-rc6
>>
>> My steps in the Qemu qmp:
>> 1) telnet 127.0.0.1 4444
>>
>> result:
>> Trying 127.0.0.1...
>> Connected to 127.0.0.1.
>> Escape character is '^]'.
>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, 
>> "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
>>
>> 2) { "execute": "qmp_capabilities" }
>>
>> result:
>> {"return": {}}
>>
>> 3) If inject correctable error:
>> { "execute": "cxl-inject-correctable-error",
>>     "arguments": {
>>         "path": "/machine/peripheral/cxl-mem0",
>>         "type": "physical"
>>     } }
>>
>> result:
>> {"return": {}}
>>
>> 3) If inject uncorrectable error:
>> { "execute": "cxl-inject-uncorrectable-errors",
>>   "arguments": {
>>     "path": "/machine/peripheral/cxl-mem0",
>>     "errors": [
>>         {
>>             "type": "cache-address-parity",
>>             "header": [ 3, 4]
>>         },
>>         {
>>             "type": "cache-data-parity",
>>             "header": 
>> [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>>         },
>>         {
>>             "type": "internal",
>>             "header": [ 1, 2, 4]
>>         }
>>         ]
>>   }}
>>
>> result:
>> {"return": {}}
>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": 
>> "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>>
>> Many thanks
>> Yuquan
>>
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]