[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Treacherous gcc-10 defect
From: |
Vadim Zeitlin |
Subject: |
Re: [lmi] Treacherous gcc-10 defect |
Date: |
Sat, 19 Dec 2020 20:37:53 +0100 |
Sorry for the delay with response, I hoped to be able to shed some extra
light on this issue, but after trying, and failing, to find anything really
useful, I've decided to finally reply to this email, even though I still
don't have anything constructive to propose.
On Thu, 10 Dec 2020 23:14:10 +0000 Greg Chicares <gchicares@sbcglobal.net>
wrote:
GC> On 12/10/20 1:26 PM, Greg Chicares wrote:
GC> [...a certain '-fno-omit-frame-pointer' testcase...]
GC> > succeeds with gcc-10, but fails with gcc-8. Accordingly, I'll
GC> > restrict the '-fomit-frame-pointer' workaround to
GC> > x86_64-w64-mingw32-gcc-8.x only, so that it doesn't propagate
GC> > to gcc-10 builds when we upgrade the compiler (very soon).
GC>
GC> TL;DR: x86_64-w64-mingw32 gcc-10 seems to need '-fomit-frame-pointer'.
This looks extraordinarily bad. I couldn't find any existing bug for this
in gcc bugzilla, do you think it would be worth spending time on providing
the minimal reproducible test case and reporting it?
GC> Earlier today I ran this command in a chroot with MinGW-w64 gcc-10:
GC> make raze; ./nychthemeral_test.sh
GC> ['raze' is a brutally emphatic 'clean' target]
GC> and observed the following output (only with gcc-10; not with
GC> MinGW-w64 gcc-8).
GC>
GC> LMI_TRIPLET = "x86_64-w64-mingw32"
GC> Production system built--ready to start GUI test in another session.
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? test failed: '0.666666666666667' == '0'
GC> ???? 6 test errors detected; 472 tests succeeded
GC> ???? returning with error code 201
GC> ????
GC> ???? errors detected; see stdout for details
I can confirm that I can reproduce the problem here too, using x86_64
cross-compiler. This doesn't happen when using i686 MinGW cross-compiler or
using the native x86_64 compiler.
GC> The failure arose here:
GC> test_interconvertibility(y, "0.666666666666667" , __FILE__, __LINE__);
GC> and, at a wild guess, it looks as though the compiler treated the
GC> expression "(2.0 / 3.0)" as integral.
Looking at the generated code, this is not the case. What actually happens
here is this:
1. The compiler doesn't compute 2.0/3.0 on its own, it generates code to do
it during run-time, using SSE2 xmm registers. I'm not sure what exactly
prevents it from doing constant folding, but something does.
2. When calling test_interconvertibility((2.0 / 3.0)) at line 386 of
numeric_io_test.cpp, it doesn't recompute this result but reuses the
same xmm6 register it used when computing for a previous call to the
same function at the line 280, i.e. simply does "$xmm0 = $xmm6" before
calling the function, which takes its first double parameter in xmm0 in
the usual calling convention.
3. The calling code is exactly the same when using and not using
-f[no-]omit-frame-pointer and while the code in
test_interconvertibility() is quite different (not only because it uses
rbp as another general purpose register, which it does only a couple of
times anyhow, but because all stack addresses are different, so it's a
bit difficult to reconcile them between the 2 versions), it does save
and restore xmm6 on entering and exiting the function in both cases.
4. Yet something does change the value of xmm6 in the build using
-fno-omit-frame-pointer between the two calls because it's clearly wrong
when the function is called again (it's not actually 0, but 4.94066e-324
which is less random than it might appear because it corresponds to the
IEEE-754 64-bit double value of exactly 1). I couldn't hunt down where
exactly is it being changed yet, but I think I should be able to, if I
spend more time on this. The problem is that I'm not sure if it's going
to be really useful, producing a minimal example reproducing the problem
would probably be more so. But it still would take quite some time. Do
you think it would be useful to spend it on this?
GC> An alternative and wilder guess
GC> is that '-fno-omit-frame-pointer' causes the error, but generating
GC> incorrect code without any diagnostic seemed so implausible
Yet this is almost certainly what happens here.
Regards,
VZ
pgpVlg43cBNec.pgp
Description: PGP signature