[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] Performance of bourn_cast
From: |
Greg Chicares |
Subject: |
[lmi] Performance of bourn_cast |
Date: |
Sat, 18 Mar 2017 16:17:01 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0 |
Here are optimized and unoptimized speed measurements for the new cast
added in commit 5e36b0a6bc9fed0788b6f74a9896387c8a729cb4:
$make $coefficiency unit_tests unit_test_targets=bourn_cast_test.exe
direct: 3.159e-004 s = 315883 ns, mean of 100 iterations
S to U: 2.334e-002 s = 23343216 ns, mean of 43 iterations
U to S: 2.501e-002 s = 25008472 ns, mean of 40 iterations
$make $coefficiency build_type=unoptimized CXXFLAGS='-O0' unit_tests
unit_test_targets=bourn_cast_test.exe
direct: 2.282e-003 s = 2281958 ns, mean of 100 iterations
S to U: 2.772e-002 s = 27718286 ns, mean of 37 iterations
U to S: 2.683e-002 s = 26828771 ns, mean of 38 iterations
That may be good enough for production use. We're using boost-1.33.1,
which has an "optimized" numeric_cast in a header-only library that
isn't too enormous:
$i686-w64-mingw32-g++ -E
/opt/lmi/third_party/include/boost/numeric/conversion/cast.hpp -I
/opt/lmi/third_party/include -DBOOST_STRICT_CONFIG 2>&1 |wc
7280 19577 203919
but we want to replace it anyway because unknown issues may lurk in
those 203919 bytes. It's a good idea first to test its performance:
---------8<--------8<--------8<--------8<--------8<--------8<--------8<-------
diff --git a/bourn_cast_test.cpp b/bourn_cast_test.cpp
index 1732e98..7d774d2 100644
--- a/bourn_cast_test.cpp
+++ b/bourn_cast_test.cpp
@@ -21,7 +21,13 @@
#include "pchfile.hpp"
-#include "bourn_cast.hpp"
+//#include "bourn_cast.hpp"
+#include <boost/cast.hpp>
+template<typename To, typename From>
+To bourn_cast(From from)
+{
+ return boost::numeric_cast<To>(from);
+}
#include "miscellany.hpp" // stifle_warning_for_unused_variable()
#include "test_tools.hpp"
--------->8-------->8-------->8-------->8-------->8-------->8-------->8-------
[Ignore the test failures, all of which indicate only that the boost
implementation throws different exceptions than bourn_cast.]
Without gcc optimization, it performs no better than bourn_cast:
'-O0':
direct: 2.276e-003 s = 2275670 ns, mean of 100 iterations
S to U: 2.352e-002 s = 23517581 ns, mean of 43 iterations
U to S: 2.823e-002 s = 28228958 ns, mean of 36 iterations
However, with gcc optimization, boost::numeric_cast is almost two
orders of magnitude faster, and almost as fast as the "direct" case,
which is a plain, unguarded static_cast:
'-O2':
direct: 4.187e-004 s = 418724 ns, mean of 100 iterations
S to U: 4.266e-004 s = 426604 ns, mean of 100 iterations
U to S: 4.957e-004 s = 495684 ns, mean of 100 iterations
Reverting the patch above, let's try everything we can think of to
hand-optimize bourn_cast:
---------8<--------8<--------8<--------8<--------8<--------8<--------8<-------
diff --git a/bourn_cast.hpp b/bourn_cast.hpp
index 87b61bb..6d4dffe 100644
--- a/bourn_cast.hpp
+++ b/bourn_cast.hpp
@@ -80,12 +80,18 @@ To bourn_cast(From from)
using from_traits = std::numeric_limits<From>;
static_assert( to_traits::is_specialized, "");
static_assert(from_traits::is_specialized, "");
+ static constexpr bool to_is_unsigned = !to_traits::is_signed;
+ static constexpr bool from_is_signed = from_traits::is_signed;
+ static constexpr To lower_bound = to_traits::lowest();
+ static constexpr To upper_bound = to_traits::max();
+ static constexpr bool must_test_lower = from_traits::lowest() <
lower_bound;
+ static constexpr bool must_test_upper = upper_bound < from_traits::max();
- if(! to_traits::is_signed && from < 0)
+ if(to_is_unsigned && from_is_signed && from < 0)
throw std::runtime_error("Cast would convert negative to unsigned.");
- if(from_traits::is_signed && from < to_traits::lowest())
+ if(from_is_signed && must_test_lower && from < lower_bound)
throw std::runtime_error("Cast would transgress lower limit.");
- if(to_traits::max() < from)
+ if(must_test_upper && upper_bound < from)
throw std::runtime_error("Cast would transgress upper limit.");
return static_cast<To>(from);
# if defined __GNUC__
--------->8-------->8-------->8-------->8-------->8-------->8-------->8-------
Maybe that helped a little in the middle ("S to U") case, without
optimization:
'-O0':
direct: 2.260e-003 s = 2260220 ns, mean of 100 iterations
S to U: 2.392e-002 s = 23917486 ns, mean of 42 iterations
U to S: 2.733e-002 s = 27328600 ns, mean of 37 iterations
But it didn't help at all with compiler optimization:
'-O2':
direct: 3.183e-004 s = 318337 ns, mean of 100 iterations
S to U: 2.315e-002 s = 23146282 ns, mean of 44 iterations
U to S: 2.550e-002 s = 25502980 ns, mean of 40 iterations
How can Cacciola have made his reimplementation so much faster?
His proposal "N1879=05-0139" claims to have:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1879.htm
| fixed all the range-checking problems of the initial version and
| even gained compile-time optimizations by totally avoiding the
| range-check when unnecessary.
where "the initial version" means Henney's original, from which
bourn_cast is derived. Yet the hand-optimized bourn_cast certainly
doesn't perform any unnecessary run-time checks.
After pondering this for a long time and rereading Cacciola's paper
again and again, I stumbled upon his secret. SPOILER ALERT--this
astounding secret is revealed below. Retesting bourn_cast with this
amazing improvement applied:
'-O0':
direct: 2.248e-003 s = 2247836 ns, mean of 100 iterations
S to U: 2.392e-002 s = 23920379 ns, mean of 42 iterations
U to S: 2.537e-002 s = 25372050 ns, mean of 40 iterations
'-O2':
direct: 4.188e-004 s = 418799 ns, mean of 100 iterations
S to U: 4.192e-004 s = 419169 ns, mean of 100 iterations
U to S: 4.851e-004 s = 485088 ns, mean of 100 iterations
so now it's as fast as the "optimized" boost version.
Are any of the hand optimizations above worth the complexity they add?
Here's a comparison where each line is the median of five runs, both
using the astonishing speedup technique:
'-O2', "hand optimized" code:
direct: 4.184e-004 s = 418430 ns, mean of 100 iterations
S to U: 4.214e-004 s = 421380 ns, mean of 100 iterations
U to S: 4.920e-004 s = 491967 ns, mean of 100 iterations
'-O2', original commit:
direct: 4.190e-004 s = 418974 ns, mean of 100 iterations
S to U: 4.208e-004 s = 420830 ns, mean of 100 iterations
U to S: 5.151e-004 s = 515096 ns, mean of 100 iterations
There does seem to be a slight improvement. I wouldn't suppose that
caching numeric_traits values in static constexpr statements has any
benefit. Perhaps 'must_test_lower' and 'must_test_upper' actually
do help, though the timing differences are so small that they don't
really prove anything; yet they might make a measurable difference in
some case other than those tested. Vadim, what do you think?
SPOILER: The amazing secret? Add 'inline'. I had supposed that these
days that keyword is mainly useful for avoiding ODR violations, and
that the compiler would perform inlining whenever it would help,
without any hint from me--so I was surprised.
- [lmi] Performance of bourn_cast,
Greg Chicares <=