[lmi] Ancient rate-table programs: GIGO

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Ancient rate-table programs: GIGO

From:	Greg Chicares
Subject:	[lmi] Ancient rate-table programs: GIGO
Date:	Sat, 10 Dec 2016 22:16:55 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.4.0

| I have been asked,--"Pray, Mr. Babbage, if you put into the machine
| wrong figures, will the right answers come out?" [...] I am not
| able rightly to apprehend the kind of confusion of ideas that could
| provoke such a question.
  --Charles Babbage

Kim has discovered a new class of errors in historical rate tables.
A text-format table that looked like this in relevant part [note the
inconsistency]:

Number of decimal places: 6
Table values:
...
 25  0.00002551

[spoiler alert--the inconsistency is identified below]

was merged into a binary {.dat .ndx} blob using some unknown version
of some unknown tool, which might have been any of
 (1) our old command-line program, or
 (2) an old SOA-provided GUI program, or
 (3) an old SOA-provided spreadsheet addin.
For (1), we have the full source, which includes some source that the
SOA published for its original 1990's version only. However, I suspect
that either (2) or (3) was used, for which source was never published
(the earliest version of (2) at least should probably have used the
same non-GUI code as (1), but we can't even be sure of that). Whatever
program was used, it wrote these incompatible values:
  6 number of decimals (six, not eight)
  0.00002551 value at age 25 (rounded to eight decimals, not six)
into the blob, and then the original input file was discarded. Today,
in production, the precise value 0.00002551 is used, but extracting
the table rounds that value to 0.000026 .

Here is a patch that I will probably commit on the belief that it
does no harm yet adds some value:

--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------
diff --git a/rate_table_tool.cpp b/rate_table_tool.cpp
index c0cb113..0fe90b1 100644
--- a/rate_table_tool.cpp
+++ b/rate_table_tool.cpp
@@ -340,7 +340,17 @@ int verify(fs::path const& database_filename)
                     << LMI_FLUSH
                     ;
                 }
-
+            if(new_table != orig_table)
+                {
+                // This is not really fatal, it is only used here to throw an
+                // exception in a convenient way.
+                fatal_error()
+                    << "After loading and saving the original table #"
+                    << num
+                    << "binary contents differed.\n"
+                    << LMI_FLUSH
+                    ;
+                }
             }
         catch(std::exception const& e)
             {
@@ -395,6 +405,10 @@ int verify(fs::path const& database_filename)
             {
             table const& orig_table = orig_db.get_nth_table(i);
             table const& new_table = new_db.get_nth_table(i);
+            if(new_table.compute_hash_value() != 
orig_table.compute_hash_value())
+                {
+                std::cout << "CRC mismatch: " << orig_table.number() << 
std::endl;
+                }
             if(new_table != orig_table)
                 {
                 std::cout
-------->8-------->8-------->8-------->8-------->8-------->8-------->8--------

The second chunk might indeed do nothing. My guess is that a similar
or identical condition is tested upstream, throwing an exception like

  Verification failed for table #140: Error reading table from string:
  bad data for table 140: hash value 2893035046 doesn't match the
  computed hash value 1855339846.

on failure. However, the first chunk definitely finds problems that
aren't otherwise found. Without that patch chunk, verify() performs
this chain of conversions:

  bin0 --> txt1 --> bin2 --> txt3

and compares txt1 and txt2 (this change compares bin0 and bin2 as
well: an independent condition). Then it converts

  bin0 --> bin1

and compares both.

Even with the patch above, the rebuilt proprietary blobs I shared
internally last week pass this validation, as does 'qx_cso'. The
files 'qx_ann' and 'qx_ins' downloaded from the SOA do not pass
validation, either with or without this patch, but those aren't
really critical--IIRC, we use the former only for 83GAM, and the
latter only for the 'sample' product.

With the patch above, I did this in a throwaway directory, using
the original blobs that are in production today:

  wine /opt/lmi/bin/rate_table_tool --accept --file=/opt/lmi/data/original 
--extract-all

and then compared each of the 594 resulting tables to its
corresponding file in (proprietary) git. Every table that
actually differed was flagged by this command:

  wine ../bin/rate_table_tool -a -f original -v 2>&1 |less -S

so I think validate() detects every problem we now know of.

Here is a separate patch that isn't suitable for production in its
present form, but I'm using it for experimental investigations.

--------8<--------8<--------8<--------8<--------8<--------8<--------8<--------
diff --git a/rate_table.cpp b/rate_table.cpp
index f2f7fb9..2dd3a8d 100644
--- a/rate_table.cpp
+++ b/rate_table.cpp
@@ -609,13 +609,50 @@ void writer::write_table_type(table_type tt)
 
 void writer::write_values
         (std::vector<double> const& values
-        ,boost::optional<uint16_t> const& num_decimals
+        ,boost::optional<uint16_t> const& putative_num_decimals
         ,boost::optional<uint16_t> const& min_age
         ,boost::optional<uint16_t> const& max_age
         ,boost::optional<uint16_t> const& select_period
         ,boost::optional<uint16_t> const& max_select_age
         )
 {
+    uint16_t required_decimals = 0;
+    for(auto v: values)
+        {
+        std::size_t d = 0;
+        std::string s = value_cast<std::string>(v);
+        std::string::size_type p = s.find('.');
+        if(std::string::npos != p)
+            {
+            s.erase(0, 1 + p);
+            d = s.size();
+            }
+        if(required_decimals < d)
+            {
+            required_decimals = d;
+            }
+        }
+
+    uint16_t specified_decimals =
+          putative_num_decimals
+        ? *putative_num_decimals
+        : throw std::logic_error("Number of decimals must be specified.")
+        ;
+    if(specified_decimals != required_decimals)
+        {
+        warning()
+            << "Table specifies " << specified_decimals << " decimals"
+            << " but " << required_decimals << " are required."
+            << LMI_FLUSH
+            ;
+// can't do this--it's const
+//        num_decimals = (uint16_t)required_decimals;
+        }
+// Useful later to trim trailing zeros too...
+//    boost::optional<uint16_t> const num_decimals(required_decimals);
+// ...but for now adjust only for lost precision.
+    boost::optional<uint16_t> const num_decimals(std::max(required_decimals, 
specified_decimals));
+
     write(e_field_min_age            , min_age             );
     write(e_field_max_age            , max_age             );
     write(e_field_select_period      , select_period       );
@@ -1299,7 +1336,9 @@ std::string* table_impl::parse_string
 {
     throw_if_duplicate_record(ostr.is_initialized(), field, line_num);
 
-    if(value.empty())
+// For the moment, permit empty comments so that ancient tables can be read.
+//    if(value.empty())
+    if(value.empty() && e_field_comments != field)
         {
         fatal_error()
             << "non-empty value must be specified for the field '"
@@ -1852,11 +1891,11 @@ void table_impl::validate()
                 break;
             }
 
-        // We have a reasonable default for this field, so don't complain if
-        // it's absent.
         if(!num_decimals_)
             {
-            num_decimals_ = 6;
+            // This should be unreachable.
+            fatal_error() << "Number of decimals not specified." << LMI_FLUSH;
+//            num_decimals_ = 12;
             }
 
         // If we don't have the hash, compute it ourselves. If we do, check

-------->8-------->8-------->8-------->8-------->8-------->8-------->8--------

It makes a coarse guess of the precision of the tabular rates, warns
if that doesn't match the precision specified in the binary blob,
and uses the greater precision. For the table with the putatively
six-digit value mentioned at the beginning of this message, it says:

  Table specifies 6 decimals but 8 are required.
  [file /opt/lmi/src/lmi/rate_table.cpp, line 646]

  Verification failed for table #140: Error reading table from string: bad data 
for table 140: hash value 2893035046 doesn't match the computed hash value 
1855339846.
  [file /opt/lmi/src/lmi/rate_table.cpp, line 2325]

and then writes the table with the clearly intended eight decimals.
It issues many false positive warnings such as:
  Table specifies 8 decimals but 19 are required.
where the last ten digits are '9999999997' or '0000000002'; clearly
it would be more useful if I refined that. Such warnings cannot be
dismissed out of hand: table #150 in the same proprietary database
has values with seventeen significant digits at ages 83-85, but at
other ages values actually were rounded to the specified number of
decimals, so this is probably a spreadsheet error. Other warnings
like
  Table specifies 8 decimals but 6 are required.
indicate tables that explicitly include insignificant trailing zeros
that we should remove.

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Ancient rate-table programs: GIGO, Greg Chicares <=
- Re: [lmi] Ancient rate-table programs: GIGO, Vadim Zeitlin, 2016/12/11
  - Re: [lmi] Ancient rate-table programs: GIGO, Greg Chicares, 2016/12/11
    - Re: [lmi] Ancient rate-table programs: GIGO, Vadim Zeitlin, 2016/12/11
    - Re: [lmi] Ancient rate-table programs: GIGO, Greg Chicares, 2016/12/12

Prev by Date: Re: [lmi] Detecting a non-empty second line in commit-msg hook
Next by Date: Re: [lmi] Ancient rate-table programs: GIGO
Previous by thread: [lmi] Detecting a non-empty second line in commit-msg hook
Next by thread: Re: [lmi] Ancient rate-table programs: GIGO
Index(es):
- Date
- Thread