[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04
From: |
Albert Chu |
Subject: |
Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04 |
Date: |
Tue, 01 Feb 2011 17:48:03 -0800 |
Hey Robert,
The following beta release has a bmc-watchdog that has (hopefully) fixed
logging.
http://download.gluster.com/pub/freeipmi/qa-release/freeipmi-1.0.2.beta2.tar.gz
If you could check it out, that'd be great.
Al
On Tue, 2011-02-01 at 17:20 -0800, Albert Chu wrote:
> Hi Robert,
>
> On Tue, 2011-02-01 at 11:40 -0800, Robert Hardy wrote:
> > It is possible that there is a bios option which starts the watchdog
> > which is enabled.
> > Once I get a chance, I will dig around in the BIOS and see.
>
> I think a more likely scenario would be the IPMI kernel driver is
> starting up the watchdog and racing w/ the FreeIPMI one. Are you
> loading the IPMI kernel driver?
>
> > I would think it would be much better behaviour on startup to do a
> > equivalent to bmc-watchdog -y then start the watchdog.
>
> I had to look this up (b/c I couldn't remember, but was fairly certain)
> the IPMI spec indicates that the watchdog timer is required to be turned
> off when a node is rebooted (27.1).
>
> > Failing to start simply because the BIOS started the countdown seems
> > very very bad to me especially without logging anything.
>
> The logging portion of this issue should be fixed w/ the next release.
>
> > You're left in
> > a state where the watchdog dies quietly and the server hard reboots
> > every couple of minutes.
>
> If the BIOS happens to be starting the countdown, that's *REALLY* bad on
> the part of the BIOS programmers. Whoever starts the countdown needs to
> manage it. It can't be trusted for some other random piece of software
> to handle.
>
> So just so I understand the situation correctly, when you disable the
> bmc-watchdog daemon, does the problem go away? The FreeIPMI
> bmc-watchdog does not start any timer until it determines the timer is
> stopped. Since the timer is already running, it never starts it.
>
> Al
>
>
> > I'm willing to test anything you send my way. The server isn't really in
> > production yet but will be soon.
> >
> > Ultimately I'm trying to package some better .debs for use on Ubuntu.
> > The current ones are badly packaged, to the point of of being unusable.
> > I've re-written the init script for Ubuntu but I'd really like to see an
> > upstart based one....
> >
> > Rob
> >
> > On 2011-02-01 12:54 PM, Albert Chu wrote:
> > > Hey Robert,
> > >
> > > I think I see the problem(s). I call _err_exit(), which writes to
> > > stderr, instead of _daemon_error_exit() which writes to the log. That's
> > > the error logging issue, which is secondary to the real one.
> > >
> > > As for the real issue, I think this is being hit:
> > >
> > > if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING)
> > > _err_exit ("watchdog timer must be stopped before running daemon");
> > >
> > > For some reason, your BMC think's the watchdog is running from the
> > > start. You could verify w/ bmc-watchdog --get if if you don't star thte
> > > timer. Perhaps it's a hardware bug?
> > >
> > > As an experiment, would you be willing to try a beta that removed this
> > > check? The issue is, I have no idea what the consequences of removing
> > > this check will be on your motherboard if there's a bug in it.
> > >
> > > Al
> > >
> > > On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote:
> > >> That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is
> > >> logged at startup (or after the unexpected exit) during bootup.
> > >>
> > >> I've put all sorts of debugging lines in my init script for bmc-watchdog.
> > >>
> > >> I finally ended up doing doing this at root:
> > >> mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real
> > >>
> > >> and then putting this in /usr/sbin/bmc-watchdog:
> > >> #!/bin/bash
> > >> strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@
> > >>
> > >> At bootup the bmc-watchdog initscript does launch a process with a new
> > >> PID but it does NOT log the regular "starting bmc-watchdog daemon". It
> > >> in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING
> > >> BOOT UP.
> > >>
> > >> The strace above captured bmc-watchdog running at bootup and the same
> > >> process exiting here at the last few lines:
> > >>
> > >> 1584 semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0
> > >> 1584 nanosleep({0, 1000}, NULL) = 0
> > >> 1584 write(2, "bmc-watchdog.real: watchdog time"..., 72) = -1 EBADF
> > >> (Bad file descriptor)
> > >> 1584 exit_group(1) = ?
> > >>
> > >> I've posted the entire strace here:
> > >> http://webcon.ca/~rhardy/bmcdrop/
> > >>
> > >> Can you parse that and make any suggestions as to why it would exit
> > >> uncleanly and only on boot up?
> > >>
> > >> I'm not quite sure what is going on, but it seems to be trying to write
> > >> on a bad file descriptor, getting an error and then exiting.
> > >> From the strace, file descriptor 2 is in fact closed so that error
> > >> makes sense to me. The real question is it trying to write to FD 2?
> > >>
> > >> When I restart bmc-watchdog when it gets to the same place it properly
> > >> writes the startup message on file descriptor 0 which is the log file
> > >> which was opened earlier...
> > >>
> > >> 2466 write(0, "[Jan 31 18:03:23]: starting bmc-"..., 48) = 48
> > >>
> > >> I'm open to debugging suggestions too... Ideas?
> > >>
> > >> Thanks for your help,
> > >> Rob
> > >>
> > >> On 2011-01-28 5:37 PM, Albert Chu wrote:
> > >>> Hey Robert,
> > >>>
> > >>> That is indeed strange. Does the bmc-watchdog log say anything? (I
> > >>> can't remember the exact location, but I think it's /var/log/freeipmi/
> > >>> something).
> > >>>
> > >>> Al
> > >>>
> > >>> On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote:
> > >>>> I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit
> > >>>> on
> > >>>> several fairly new unloaded Supermicro servers.
> > >>>>
> > >>>> On only one (always the same server) of four servers the bmc-watchdog
> > >>>> process quietly exits shortly after start up leaving the system setup
> > >>>> for a
> > >>>> hard reset shortly after bootup.
> > >>>>
> > >>>> The options and builds are identical on all of the servers. These are
> > >>>> my
> > >>>> options: OPTIONS="-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60"
> > >>>>
> > >>>> Through debugging I've confirmed on boot up:
> > >>>>
> > >>>> - The init script gets run
> > >>>>
> > >>>> - It launches bmc-watchdog saves a new PID correctly in
> > >>>> /var/run/bmc-watchdog.pid.
> > >>>>
> > >>>> - Checking for a bmc-watchdog process in rc.local shows it isn't
> > >>>> running and
> > >>>> the timer is counting down.
> > >>>>
> > >>>> - There is no shutdown message logged when the process disappears
> > >>>> during bootup.
> > >>>>
> > >>>> - There are no messages suggesting the process was killed
> > >>>>
> > >>>> On shutdown the init script gets as far as removing
> > >>>> /var/run/bmc-watchdog.pid and seems to work fine.
> > >>>>
> > >>>> If I stuff this in rc.local the bmc-watchdog starts up properly and
> > >>>> never
> > >>>> seems to die again until the next reboot:
> > >>>> /usr/sbin/service bmc-watchdog stop
> > >>>> /usr/sbin/service bmc-watchdog start
> > >>>>
> > >>>> All in all this is very weird behaviour. Is it possible a newer
> > >>>> version of
> > >>>> bmc-watchdog would address this? i.e. is this a known bug?
> > >>>>
> > >>>> Any other ideas why this is happening (or how I can debug further)?
> > >>>>
> > >>>> Regards,
> > >>>> Rob
> > >>>>
> > >>>> _______________________________________________
> > >>>> Freeipmi-users mailing list
> > >>>> address@hidden
> > >>>> http://lists.gnu.org/mailman/listinfo/freeipmi-users
> >
--
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory