lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] BERT: Error records from previous boot


From: Greg Chicares
Subject: Re: [lmi] BERT: Error records from previous boot
Date: Tue, 12 May 2020 20:35:57 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0

On 2020-05-12 17:42, Vadim Zeitlin wrote:
> On Tue, 12 May 2020 15:42:09 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> [    1.982866] BERT: Error records from previous boot:
> GC> [    1.985400] [Hardware Error]: event severity: fatal
> GC> [    1.985458] [Hardware Error]:  Error 0, type: fatal
> GC> [    1.985515] [Hardware Error]:   section_type: PCIe error
> GC> [    1.985572] [Hardware Error]:   port_type: 4, root port
[...]
>  I really have no idea, but, from (very) high level point of view, the PCIe
> error must be due to either the host/controller itself or one of the
> devices using it. If it's the host/controller, the only thing to do is to
> replace it, i.e. the motherboard, and you would be probably unwilling to do
> it until it just stops working in any case. If it's one of the devices, you
> could perhaps run stress tests on it. I don't know what kind of devices do
> you have on this bus,

Two xeon CPUs, four memory sticks, two 850 pro SSDs, and a radeon 5450.
Oh, and a DVD thing that's rarely used.

> some common candidates would be a graphics card or a
> SSD. If it's the former, it's not really a big deal neither as in the worst
> case you would just replace it too when/if it stops working.

The graphics card is passively cooled; it measures forty-two celsius now.
I see no abnormality on the screen.

> If it's the
> latter, it's potentially more concerning, but if smartmon tools don't show
> any errors/problems I wouldn't do anything about it yet neither.

Selected output below. AFAICT, it says that neither of these SSDs has
had any errors, both have 86% of their chronological lifespan remaining,
and I've written six to eight tebibytes of data on each, versus 300 TiB
manufacturer's specified endurance (same for 500M as for 1T).

BTW, I've always booted with
  libata.force=noncqtrim
since acquiring these drives (they're blacklisted for trim now, but
may not have been in old kernels). I've never tried to "trim" manually;
IIRC, they're 7% overprovisioned by the manufacturer, I reserved an
extra 10% when I partitioned them, and no partition is ever more than
about 50% full (except when I make a mistake with 'rinse').

#smartctl -a /dev/sda
Device Model:     Samsung SSD 850 PRO 512GB
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       
-       68074
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       
-       98
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       
-       37
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       
-       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       
-       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       
-       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       
-       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0032   077   067   000    Old_age   Always       
-       23
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       
-       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       
-       3
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       
-       61
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       
-       16409734671

SMART Error Log Version: 1
No Errors Logged

#smartctl -a /dev/sdb
Device Model:     Samsung SSD 850 PRO 1TB
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       
-       68981
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       
-       69
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       
-       13
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       
-       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       
-       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       
-       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       
-       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0032   076   063   000    Old_age   Always       
-       24
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       
-       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       
-       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       
-       40
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       
-       13011224498

SMART Error Log Version: 1
No Errors Logged


reply via email to

[Prev in Thread] Current Thread [Next in Thread]