coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Consistent kernel crashes with heavy md5sum usage... Thoughts?


From: Daniel Freedman
Subject: Consistent kernel crashes with heavy md5sum usage... Thoughts?
Date: Wed, 12 Feb 2014 12:49:51 -0500
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

I'm not sure if this is the best forum for my question, but I hope the
greater "coreutils" audience might have some suggestions, given the
nexus here to the coreutils' md5sum utility...  In short, I've noticed
that I'm consistently getting kernel crashes on two of my (identical)
servers when performing heavy md5sum usage.

This is occurring while I'm md5sum'ing all files on a large partition
as part of archive verification --- say 1 million files corresponding
to 1 TByte of storage.  If I perform this repeatedly, the machines
seem to lock up about once a week.

Naturally, such md5sum usage is putting heavy load on processor,
memory, and even power supply, and my initial inclination is generally
that I must have some faulty components.  Even after otherwise
ambiguous diagnostics (described below), I'm highly skeptical that
there's anything here inherent to the md5sum codebase.

To summarize where I am now: I've been very extensively testing all of
the likely culprits among components on both of my servers --- running
memtest86 upon boot for 3+ days, memtester in userspace for 24 hours,
repeated kernel compiles with various '-j' values, and the 'stress'
and 'stressapptest' load generators (see [1] for full details) --- and
I have never seen even a hiccup in server operation under such
"artificial" environments --- only with heavy md5sum operation.

At least from my past experiences (with scientific HPC clusters), such
diagnostic results would normally seem to largely rule out most
problems with the processor, memory, mainboard subsystems.  The PSU is
often a little harder to rule out, but the 400W Seasonic PSUs are
rated at 2--3 times the wattage I should really need, even under peak
load (given each server's single-socket CPU is 65W at max TDP, there
are only a few HDs and one SSD, and no discrete graphics at all, of
course).

I'm further surprised to see the exact same kernel-crash behavior on
two separate, but identical, servers, which leads me to wonder if
there's possibly some regression between the hardware (given that it's
relatively new Haswell microcode / silicon) and the (kernel?)
software.

For those interested, here are the setups:

  Mainboard:  Supermicro X10SLQ
  Processor:  Intel Haswell i7-4770S (65W)
  Memory:     32GB Kingston DDR3 RAM (4x KVR16N11/8)
  PSU:        SeaSonic SS-400FL2 400W (Fanless) PSU
  O/S:        Debian Gnu/Linux 7.4 Wheezy (amd64)
  Kernel:     Linux 3.11 ('3.11-0.bpo.2-amd64' via wheezy-backports)
  Filesystem: Ext4 (with default settings upon creation)

As you might observe, I'm running relatively high-end HW, and I'm also
obviously not doing any over-clocking, etc.  I'd love to also be able
to share dumps of the kernel crash, but that's non-trivial for me
given my knowledge base, some details of the technical setup, etc.

Any thoughts on what might be occurring here?  Or what I should focus on?
Thanks in advance.

Best wishes,
Daniel


[1] Here are the exact steps I took for various stress-testing (with
    root privileges when necessary, such as for memtester):

  aptitude install stress
  stress --cpu 8 --io 4 --vm 2 --timeout 10s --dry-run
  stress --cpu 8 --io 4 --vm 2 --hdd 3 --timeout 60s
  stress --cpu 8 --io 8 --vm 8 --hdd 4 --timeout 5m

  aptitude install stressapptest
  stressapptest -m 8 -i 4 -C 4 -W -s 30
  stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1gb -s 30
  stressapptest -m 8 -i 4 -C 4 -W -f /root/sat-file-test --filesize 1024 
--random-threads 4 -s 30
  stressapptest -m 8 -i 4 -C 4 -W --cc_test -s 30
  stressapptest -m 8 -i 4 -C 4 -W --local_numa -s 30
  stressapptest -m 8 -i 4 -C 4 -W -n 127.0.0.1 --listen -s 30
  stressapptest -m 12 -i 6 -C 8 -W -f /root/sat-file-test --filesize 1024 
--random-threads 4 -n 127.0.0.1 --listen -s 300

  aptitude install linux-source
  cp /usr/src/linux-source-3.2.tar.bz2 /root/
  tar xvfj linux-source-3.2.tar.bz2 
  cd linux-source-3.2/
  make defconfig
  time make 1>LOG 2>ERR
  make mrproper
  make defconfig
  time make -j16 1>LOG 2>ERR

  aptitude install memtester
  memtester 30G

  aptitude install memtest86+  # reboot and run for 3+ days



reply via email to

[Prev in Thread] Current Thread [Next in Thread]