[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] BERT: Error records from previous boot
From: |
Greg Chicares |
Subject: |
Re: [lmi] BERT: Error records from previous boot |
Date: |
Tue, 12 May 2020 20:35:57 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 |
On 2020-05-12 17:42, Vadim Zeitlin wrote:
> On Tue, 12 May 2020 15:42:09 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> [ 1.982866] BERT: Error records from previous boot:
> GC> [ 1.985400] [Hardware Error]: event severity: fatal
> GC> [ 1.985458] [Hardware Error]: Error 0, type: fatal
> GC> [ 1.985515] [Hardware Error]: section_type: PCIe error
> GC> [ 1.985572] [Hardware Error]: port_type: 4, root port
[...]
> I really have no idea, but, from (very) high level point of view, the PCIe
> error must be due to either the host/controller itself or one of the
> devices using it. If it's the host/controller, the only thing to do is to
> replace it, i.e. the motherboard, and you would be probably unwilling to do
> it until it just stops working in any case. If it's one of the devices, you
> could perhaps run stress tests on it. I don't know what kind of devices do
> you have on this bus,
Two xeon CPUs, four memory sticks, two 850 pro SSDs, and a radeon 5450.
Oh, and a DVD thing that's rarely used.
> some common candidates would be a graphics card or a
> SSD. If it's the former, it's not really a big deal neither as in the worst
> case you would just replace it too when/if it stops working.
The graphics card is passively cooled; it measures forty-two celsius now.
I see no abnormality on the screen.
> If it's the
> latter, it's potentially more concerning, but if smartmon tools don't show
> any errors/problems I wouldn't do anything about it yet neither.
Selected output below. AFAICT, it says that neither of these SSDs has
had any errors, both have 86% of their chronological lifespan remaining,
and I've written six to eight tebibytes of data on each, versus 300 TiB
manufacturer's specified endurance (same for 500M as for 1T).
BTW, I've always booted with
libata.force=noncqtrim
since acquiring these drives (they're blacklisted for trim now, but
may not have been in old kernels). I've never tried to "trim" manually;
IIRC, they're 7% overprovisioned by the manufacturer, I reserved an
extra 10% when I partitioned them, and no partition is ever more than
about 50% full (except when I make a mistake with 'rinse').
#smartctl -a /dev/sda
Device Model: Samsung SSD 850 PRO 512GB
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
9 Power_On_Hours 0x0032 086 086 000 Old_age Always
- 68074
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always
- 98
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always
- 37
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always
- 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always
- 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always
- 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always
- 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0032 077 067 000 Old_age Always
- 23
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always
- 0
199 CRC_Error_Count 0x003e 099 099 000 Old_age Always
- 3
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always
- 61
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always
- 16409734671
SMART Error Log Version: 1
No Errors Logged
#smartctl -a /dev/sdb
Device Model: Samsung SSD 850 PRO 1TB
SMART overall-health self-assessment test result: PASSED
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0
9 Power_On_Hours 0x0032 086 086 000 Old_age Always
- 68981
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always
- 69
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always
- 13
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always
- 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always
- 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always
- 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always
- 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0032 076 063 000 Old_age Always
- 24
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always
- 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always
- 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always
- 40
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always
- 13011224498
SMART Error Log Version: 1
No Errors Logged