lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] [PATCH] Fix build with g++5


From: Greg Chicares
Subject: Re: [lmi] [PATCH] Fix build with g++5
Date: Fri, 18 Sep 2015 21:34:34 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0

On 2015-09-18 02:16, Vadim Zeitlin wrote:
> On Thu, 17 Sep 2015 22:20:03 +0000 Greg Chicares <address@hidden> wrote:

[...'const' follows the type it modifies, so 'volatile' should too, globally...]

>  But I can live with local consistency as well. OTOH it would be trivial to
> just exchange the order of "volatile" and "T" here and in a few other
> places to achieve uniform consistency...

Sure, but let's wait until after the Halloween release.

> GC> (2) the union trick prevents such optimizations, whereas the compiler
> GC> can recognize memcpy() and replace it with an optimization-friendly
> GC> intrinsic.
> 
>  Just to confirm: gcc5 does inline memcpy() with "-O2". For example, in
> md5_finish_ctx() the first (existing) memcpy() function call remains in the
> optimized version but the 2 my patch adds appear as just a single "mov"
> instruction.

I was curious about that, but didn't have the time to look into it,
so thanks for sharing this analysis.

> GC> [...] this begins to make the code a little harder
> GC> to understand, and causes me to think we might be better off writing an
> GC> inline (for optimizability) memcpy_cast() function.
> 
>  I tried to keep things as simple as possible but yes, we surely could have
> such a function and we could even have 2 template parameters for it to
> optionally to return a larger type, e.g. to allow writing
> 
>       int const index_table_number = memcpy_cast<int32_t, int>(index_record);
> 
> Should I make a patch for this?

Yes, please. I'm in a quandary. The original reinterpret_cast lines seem
much clearer to me than the memcpy() replacements; but if their behavior
is undefined, then they have to change. I'm hoping that memcpy_cast<>()
might preserve code clarity while slaying the UB demon.

> GC> We may wish we had done that if g++-6 or -7 gets more aggressive about
> GC> these warnings and forces us to make lots more changes elsewhere.
> 
>  I don't think there are any other places where we would need this.

I didn't think there were formerly any aliasing problems. Generally, I
expect one surprise to foreshadow more surprises to come. But no matter
how different our personal outlooks, we're scientists, so maybe we can
agree if we can find some data.

To that end, I tried building with
  tutelary_flag='-Wstrict-aliasing=1'
(which is appended to any builtin default CFLAGS or CXXFLAGS), and found no
further issues. However, by happy accident, I had first tried
  tutelary_flag='-Wstrict-aliasing=n'
and that worked just as smoothly, so I'm guessing that gcc-3.4.5 doesn't
offer different levels for this warning--as the manual seems to confirm:
  
https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Warning-Options.html#index-Wstrict_002daliasing-227

Perhaps you could try -Wstrict-aliasing=1 or -Wstrict-aliasing=2 and
let me know what you see. AIUI, LLVM doesn't make any attempt to diagnose
aliasing problems. In a quick search, I found no other tool for this.

> GC> 
> GC>      for(int j = 0; j < number_of_values; ++j)
> GC>          {
> GC>          is.read(z, sizeof(double));
> GC> -        data_[j] = *reinterpret_cast<double*>(z);
> GC> +        double d;
> GC> +        std::memcpy(&d, z, sizeof(double));
> GC> +        data_[j] = d;
> GC>          }
> GC> 
> GC> I'm sure you don't like that any more than I do. Merely reading the code
> GC> can't be enough to make me feel comfortable accepting it, because this is
> GC> a time-critical loop, and creating a temporary variable just looks like it
> GC> might slow the system down...so let's measure it:
> 
>  Sorry, I should have mentioned that I did compare the benchmark before and
> after my changes and didn't see any statistically significant differences.
> 
>  I also didn't realize this loop was time-critical. If it is, I think it
> would be much faster to just read all doubles at once instead of doing it
> one by one. Would you be interested in making the measurements showing it
> it's worth doing this or not?

Okay--results below.

> GC> But still I balk at introducing that temporary. Would something like
> GC> this example (which I even tested, with g++-3.4.5 only):
> GC> 
> GC> -    char z[sizeof(double)];
> GC>      for(int j = 0; j < number_of_values; ++j)
> GC>          {
> GC> -        is.read(z, sizeof(double));
> GC> -        data_[j] = *reinterpret_cast<double*>(z);
> GC> +        is.read(reinterpret_cast<char*>(&data_[j]), sizeof(double));
> GC>          }
> GC> 
> GC> be acceptable to g++-5?
> 
>  Yes, it would. But if you're willing to write code like this, then IMHO we
> should just directly go to
> 
>     is.read(static_cast<char*>(static_cast<void*>(&data_[0])), 
> number_of_values * sizeof(double));
> 
> and remove the loop completely.

Using the same testing command as in my preceding email, here I'll report
only the timings for five trials:

1386664  1383332  1390109  1393764  1383697  median 1386664 HEAD
1389036  1500248  1403038  1406045  1408138  median 1406045 my long read() 
statement above
1369771  1322875  1298282  1311516  1292876  median 1311516 your loopless read()

As expected, the loopless read() wins. But was I speaking carelessly when
I said this loop is time critical? I'm sure, from past measurements, that
making this code much slower creates a bottleneck that end users can feel.
But let's test overall system speed in a meaningful way:
  for z in a b c d e; do make $coefficiency cli_tests 2>/dev/null | grep 
iterations; done
condensing the results as above:

24378031  24497276  24303548  24760823  24346506  median 24378031 my long 
read() statement above
24375800  24426109  25250134  26510577  24376345  median 24426109 your loopless 
read()

The approach that performed best in the unit test, performed worse in the
overall test. But the differences are slight: we're just measuring noise.
Conclusion: yes, I did misspeak. Thanks for pointing out the loopless read()
technique--I wouldn't have though of leveraging std::vector's contiguous-
storage guarantee that way--but I don't think the gain in performance is
worth the cost in lost clarity. Neither is my long read() statement worth
the obscurity it adds, if memcpy_cast<>() can rescue HEAD from UB without
harming clarity.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]