Re: [lmi] Ancient rate-table programs: GIGO

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Ancient rate-table programs: GIGO

From:	Greg Chicares
Subject:	Re: [lmi] Ancient rate-table programs: GIGO
Date:	Sun, 11 Dec 2016 18:38:45 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.4.0

On 2016-12-11 14:27, Vadim Zeitlin wrote:
> On Sat, 10 Dec 2016 22:16:55 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> Kim has discovered a new class of errors in historical rate tables.
> GC> A text-format table that looked like this in relevant part [note the
> GC> inconsistency]:
> GC> 
> GC> Number of decimal places: 6
> GC> Table values:
> GC> ...
> GC>  25  0.00002551
> GC> 
> GC> [spoiler alert--the inconsistency is identified below]
> GC> 
> GC> was merged into a binary {.dat .ndx} blob
> 
>  Sorry, I'm afraid I've lost the big picture here. Do I understand
> correctly that the goal is to accept such binary tables on input and
> produce output with the correct number of decimal places (8) for them?

The big picture is that we have tainted data that we must remediate.
The deeper we dig, the more problems we find, and we don't know whether
we've found them all yet. Until we do, we can't fully specify all the
procedures that may be necessary.

The immediate goal is to try what you've stated, in order to remove
one particular class of historical mistakes. That may be enough to
restore some tables to their intended state, so far as we can divine
the intention. It won't be enough for at least one table that looks
like this (simplified example):
   0 0.001
   1 0.002
   2 0.004
 ... [more ages with three-digit values]
  82 0.101
  83 0.123456789012345
In that case, it seems that the original author intended to round all
values to three decimals, but failed to do so toward the end of the
table; most likely, rounding the last value to 0.123 is the right
thing to do, but we'll need to delve into the historical archives to
be sure (perhaps, e.g., this table by its nature should be rounded up).

Ultimately, for production we'll want to use only remediated data that
pass every reasonable test. For the process of remediation, we need to
load tainted tables and try to clean them up. Once remediation is
complete, we can reject tainted text input, and refuse to produce a
tainted binary (which should be impossible unless we discover a latent
logic error); thereafter, we won't ever work with a tainted binary, so
the question whether or not we can load one will not arise.

The unreliable old process was to accept possibly-invalid text input,
merge it into a binary blob, and discard the input. The remediation
process attempts to recover (discarded) text input that would have
produced that blob, fix its errors, and re-merge it into a repaired
blob. The new maintenance process is to maintain all text input in
git, and regenerate the whole distributable blob whenever any text
input file changes.

> GC> Here is a patch that I will probably commit on the belief that it
> GC> does no harm yet adds some value:
[...]
> GC> The second chunk might indeed do nothing. My guess is that a similar
> GC> or identical condition is tested upstream, throwing an exception like
> GC> 
> GC>   Verification failed for table #140: Error reading table from string:
> GC>   bad data for table 140: hash value 2893035046 doesn't match the
> GC>   computed hash value 1855339846.
> 
>  Yes, this check is redundant.

Thanks for the confirmation. Removed from my working copy.

> GC> on failure. However, the first chunk definitely finds problems that
> GC> aren't otherwise found. Without that patch chunk, verify() performs
> GC> this chain of conversions:
> GC> 
> GC>   bin0 --> txt1 --> bin2 --> txt3
> GC> 
> GC> and compares txt1 and txt2
>                              ^
>  (I assume this was just a typo and "txt3" was meant)

Yes, you are right.

> GC> (this change compares bin0 and bin2 as well: an independent condition).
> 
>  I wouldn't say "independent", but it is indeed not identically the same
> currently because I hadn't realized we could lose information during "bin
> -> txt" conversion if the binary table contained numbers in a greater
> precision than its "number of decimals" field indicated.

None of us anticipated that.

>  The only way to ensure that this change is not needed would be to make
> certain that we never lose (or even change) any data when doing this
> conversion, however this would mean refusing to load binary files with the
> wrong "number of decimals" specified in them but, if my guess in the
> beginning of this email is correct, this isn't the right thing to do, so we
> can't really achieve this.

Ideally, in production we'd use only validated data, so there'd be no
need to revalidate it on each use. Whether "slightly" erroneous data
would be accepted or rejected is a question that wouldn't arise: we'd
presumably not repeat validity checks, especially if they're expensive;
and they wouldn't find any flaws, which would have been removed before
release. Yet catastrophic flaws (e.g., missing tables) would still
cause exceptions to be thrown.

Thus, there are some checks that need not be performed during use in
production, but must be performed during maintenance.

> GC> Here is a separate patch that isn't suitable for production in its
> GC> present form, but I'm using it for experimental investigations.
> 
> ... [patch and its explanation snipped] ...
> 
>  I think it would make sense to integrate this patch into production,

Instead of "production", I should have said the git repository.
The source code affected is excluded from the production system we
distribute.

> possibly after improving it, as this would seem to be the right (or the
> "least wrong", if you prefer) thing to do in this case. Surely doing what
> we do now, i.e. silently losing the data if the number of decimals is
> wrong, is much worse than that.

Aye.

>  However notice that if such patch were integrated into production code,
> then the extra check added by the first chunk of the first patch above
> risks to be harmful because it would still be triggered (due to the
> different values of the "number of decimals" field in "bin0" and "bin2"),
> even though everything would work [as] well [as it could possibly can].

For now at least, 'rate_table.?pp' code is not used in production.
If it eventually replaces 'actuarial_table.?pp' in production, then
we'll need to reconsider this: perhaps we'd add conditions that
would suppress this code and also the code that throws if the CRC
is invalid. But we don't need to figure that out now.

> GC> It issues many false positive warnings such as:
> GC>   Table specifies 8 decimals but 19 are required.
> GC> where the last ten digits are '9999999997' or '0000000002'; clearly
> GC> it would be more useful if I refined that. Such warnings cannot be
> GC> dismissed out of hand: table #150 in the same proprietary database
> GC> has values with seventeen significant digits at ages 83-85, but at
> GC> other ages values actually were rounded to the specified number of
> GC> decimals, so this is probably a spreadsheet error.
> 
>  Should I do anything about this, e.g. work on refining this patch or help
> with it in any other way?

No, thanks, I think I know how to do this, though of course I welcome
your comments.

> GC> Other warnings like
> GC>   Table specifies 8 decimals but 6 are required.
> GC> indicate tables that explicitly include insignificant trailing zeros
> GC> that we should remove.
> 
>  But we could also leave them unchanged, there is no real harm in this, is
> there?

We're changing plenty of stuff already. We may as well include some
changes that produce what we really want in simplest form. Then we
can retain strict tests in the maintenance process, so that a table
with uniformly two-decimal data would be rejected if presented thus:
  0 0.0000000000000000
  1 0.12
  2 0.23000000000
  3 0.450
on aethetic grounds alone. Even uniform padding is an impediment to
understanding: if a table begins like this:
  0 0.00000
  1 0.12000
  2 0.23000
  3 0.45000
I form an impression that it's two-decimal data padded on the right
with three superfluous zero columns, but then I have to scan the
whole table to learn whether to disregard the last three columns.
Worse, the superfluity invites error: it's all too easy to use a
different actual precision when modifying any row.

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Ancient rate-table programs: GIGO, Greg Chicares, 2016/12/10
- Re: [lmi] Ancient rate-table programs: GIGO, Vadim Zeitlin, 2016/12/11
  - Re: [lmi] Ancient rate-table programs: GIGO, Greg Chicares <=
    - Re: [lmi] Ancient rate-table programs: GIGO, Vadim Zeitlin, 2016/12/11
    - Re: [lmi] Ancient rate-table programs: GIGO, Greg Chicares, 2016/12/12

Prev by Date: Re: [lmi] Ancient rate-table programs: GIGO
Next by Date: Re: [lmi] Ancient rate-table programs: GIGO
Previous by thread: Re: [lmi] Ancient rate-table programs: GIGO
Next by thread: Re: [lmi] Ancient rate-table programs: GIGO
Index(es):
- Date
- Thread