[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] BERT: Error records from previous boot
From: |
Vadim Zeitlin |
Subject: |
Re: [lmi] BERT: Error records from previous boot |
Date: |
Tue, 12 May 2020 23:04:07 +0200 |
On Tue, 12 May 2020 20:35:57 +0000 Greg Chicares <address@hidden> wrote:
GC> On 2020-05-12 17:42, Vadim Zeitlin wrote:
GC> > On Tue, 12 May 2020 15:42:09 +0000 Greg Chicares <address@hidden> wrote:
GC> [...]
GC> > GC> [ 1.982866] BERT: Error records from previous boot:
GC> > GC> [ 1.985400] [Hardware Error]: event severity: fatal
GC> > GC> [ 1.985458] [Hardware Error]: Error 0, type: fatal
GC> > GC> [ 1.985515] [Hardware Error]: section_type: PCIe error
GC> > GC> [ 1.985572] [Hardware Error]: port_type: 4, root port
GC> [...]
GC> > I really have no idea, but, from (very) high level point of view, the
PCIe
GC> > error must be due to either the host/controller itself or one of the
GC> > devices using it. If it's the host/controller, the only thing to do is to
GC> > replace it, i.e. the motherboard, and you would be probably unwilling to
do
GC> > it until it just stops working in any case. If it's one of the devices,
you
GC> > could perhaps run stress tests on it. I don't know what kind of devices do
GC> > you have on this bus,
GC>
GC> Two xeon CPUs, four memory sticks, two 850 pro SSDs, and a radeon 5450.
GC> Oh, and a DVD thing that's rarely used.
CPUs and RAMs shouldn't be connected to the PCIe bus however (unless I've
missed some recent advance in computer hardware) and the DVD thing should
be on SATA bus and not on PCIe one too, so there is indeed nothing else
there -- and in fact I think your SSDs are of SATA variety too, unless you
really have 2 M2/U2 slots on your motherboard (which is rare, AFAIK). In
any case, SMART output looks fine. If you're really, really paranoid you
could use "smartctrl -t" to run either short or long self tests on them,
but this doesn't really have anything to do with the original BERT problem.
IOW, while I still don't know what the original problem was, I don't see
anything you can do about it, other than the usual stuff (backups, log
monitoring etc). And maybe use dmesg after the next reboot to see if it
happens again.
Sorry, I know it's not very useful, but this is really I can say,
VZ
pgpKyLAu9Egjl.pgp
Description: PGP signature