bug-gplusplus
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating point failure (NaN)


From: Greg Chicares
Subject: Re: Floating point failure (NaN)
Date: Tue, 05 Jun 2001 23:57:52 -0400

Ron Miller wrote:
> 
> I have spent a lot of the last week trying to track down a bug that appears
> to be OS/hardware related.
> 
> The problem that I am seeing is that NaN (not a number) is mysteriously
> appearing in some of my floating point variables even when it should not.
> 
> I posted a message last week, but since then I have simplified the code.
> 
> I have a code fragment that looks like the following:
> 
>     double da, db;
>     while ( 1 ) {
>         da = 1.0;
>         db = da;
>         if ( (da!=da) || (db!=db) ) {
>             printf( "Found NaN\n" );
>         }
>     }
> 
> This will run for hours or even days on a multi-processor machine, and then
> at the same time, several of the jobs will start printing out that they
> found NaN (maybe 3 or 4 messages) and then go on acting normally again..
> Some machines also seem more susceptible to the problem, although it seems
> to eventually fail on all machines that I have tried.
> 
> My operating system / environment is:
> 
>         Redhat 6.2
>         gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
>         (also tried gcc version 2.95.2 19991024 (release))
>         2 x 700 MHz Intel Pentium III
>         100 MHz Bus Clock Speed
>         2 GB RAM
>         4 GB Swap Space
> 
> Changing compilers or changing compiler flags doesn't seem to fix the
> problem.
> 
> It also fails with floats as well as doubles.
> 
> It seems like it might be a problem with the hardware (FPU overheating?),
> but I have been able to get it to fail on machines from multiple vendors.
> 
> Has anyone else seen anything like this before?

I've never seen anything like that, but I haven't watched a program run
as long as you have. Well, I did once manage to keep win95 running for
several days, but that wasn't repeatable.

It sure sounds like a hardware failure. Isolated bursts of errors
would seem support that hypothesis.

The old 80287 chip always ran hot. I once saw an intel guy say
you could test it by touching it with your finger; if you got
a burn that said '78208' then it passed the test. But now the
FPU is on the same die as the other stuff, isn't it?

I tried running your old program overnight, but couldn't get
it to fail.

>         2 GB RAM

Is this parity-checked RAM? Using the statistics here
  http://www.sciam.com/1999/1099issue/1099cyber.html
I figure you can go about 43 days on average without a
cosmic-ray RAM incident. Oh, wait, now that you've
simplified the program, I can see that it's going to
execute in cache, at least if nothing else is running
at the same time. Well, anyway, have a look here:
  http://patrec.com/rico/ecc/

I just had to look up whether your company's offices
are in Colorado, since according to this URL
  http://www.research.ibm.com/journal/rd/421/ziegler.html
Denver gets pelted with four times as many hadrons as
NYC, and Leadville should have thirteen times the 'soft
fail' rate as NYC; but it looks like you've got lots of
locations. Are you up in the mountains or in a brick
building or something? Have you had the place checked
for radon?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]