[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Product editor
From: |
Greg Chicares |
Subject: |
Re: [lmi] Product editor |
Date: |
Thu, 09 Feb 2006 04:17:06 +0000 |
User-agent: |
Mozilla Thunderbird 1.0.2 (Windows/20050317) |
On 2006-2-8 17:52 UTC, Evgeniy Tarassov wrote:
>
> I wanted to ask a couple of questions about lmi product_editor before
> implementing it.
> Currently in lmi cvs there are two sets of files implementing product
> and entities -- the ones without any prefix (dbvalue.hpp, dbdict.cpp,
> etc.) and others with ihs_ prefix (ihs_dbvalue.cpp, ihs_dbvalue.hpp).
> I suppose that ihs_ is the right set of files (i've got impression
> that its the refactoring of the old ones). Is it correct?
Yes, exactly. Eight years ago, we forked the code for reasons that
are too painful to recall. Most files had an (old but simple)
version with no prefix, and an 'ihs_'-prefixed version with many
new features and defects. Over the years, I've merged most of them.
Of the thirty-seven "prefixed" files that remain, fifteen relate to
the product editor, directly or indirectly.
Just ignore 'dbvalue.hpp' and use 'ihs_dbvalue.hpp', for example.
>>Every entity in every '.db4' file today depends on every one of those seven
>>axes, and that's the complete set--there are no others, today.
>>...
>> {Gender, Class, Smoking, Issue Age, Underwriting, State} (not Duration)
>
> I've noticed that there was an effort to extend the TDBValue class to
> support more than those 7 axis (see extra_axes_names,
> extra_axes_values, extra_axes_types class members in ihs_dbvalue.hpp),
> but its not finished (see
> // std::vector<double>::const_iterator i_val =
> extra_axes_values.begin();
> ...
> // i_val++; // TODO ?? Need std::vector of vectors?
> in ihs_dbvalue.cpp file (void
> TDBValue::FixupIndex(std::vector<double>& idx) const method).
I had less self-discipline then. I should either have finished that
work or removed those vestiges.
> Vis-a-vis db4 files editing: should we implement the support for
> those additional axes, or should we suppose the set of axis to be
> really freezed to those 7 axis (Gender, Class, Smoking, Issue Age,
> Underwriting, State, Duration) and simply ignore the other part of
> DBValue interface (support for additional axis)?
For now, we should certainly support only those seven axes.
Let me explain this in more detail. We face three obstacles:
(1) a library dependency,
(2) an unsuitable physical file structure, and
(3) an unsuitable logical file structure.
(1) The 'legacy' product editor depends on a non-free GUI library.
Right now, you're writing a replacement that depends on wx, which
is free. Until we have the free wx version, we can't get rid of
the old library, whose vendor stopped supporting it years ago.
But we have neither the time nor the desire to maintain the legacy
application.
(2) The '.db4' files are binary. They use 'ihs_*pios.?pp', a free
but ancient facility that's roughly similar to boost::serialize.
This is not a format I'd choose today. Today, I'd use xml.
(3) There's no really satisfying reason to distinguish the
.db4 .rnd .fnd .pol .tir
files from each other. A single format that could accommodate
them all would be much better.
These problems are all interlinked. The worst dependency is (1).
Modifying and rebuilding the legacy editor is costly and prone
to error. But if we don't do that, then we can't even add new
entities to the product database, much less replace the obsolete
serialization library (2), or even consider (3). So we need your
work on replacing (1) before we can make any progress at all.
Once that's done, I'd like to replace the serialization (2) with
xml. Right now, the product files we use in production (you only
have 'sample.*', but we have dozens of other sets) are created
by a really nasty program, which is much like 'ihs_dbdict.cpp'
function DBDictionary::WriteSampleDBFile(), but several thousand
lines long. This change is quite orthogonal to (1), but depends
on (1), because making parallel changes in the legacy editor
would be costly work that we'd just throw away and therefore
mustn't even consider doing.
Then, with obstacles (1) and (2) removed, we will be able to
unify the
.db4 .rnd .fnd .pol .tir
files. The 'extra_axes_names' artifacts you see represent an old,
incomplete, abortive attempt to do that. This "grand unification"
would affect the product editor: right now, its legacy version is
really four different editors (one for each file type, but there's
no '.fnd' editor). The '.tir' files are much like the '.db4' files
with a couple extra axes (that's what 'extra_axes_names' was
intended to deal with). The '.rnd' files could be expressed in
'.db4' form today. The '.pol' and '.fnd' files are just lists of
name:value pairs of type string, and MDGrid has more than enough
power to handle them.
Because of these interdependencies, I think we need to remove
these obstacles in the order given. Each one we remove alleviates
a current problem. And trying to solve all the problems at once
would means we'd make no progress until we finish them all. Taking
them in this order means that the new product editor initially has
to replace the legacy application, and then later has to be adapted
to a new paradigm, after the "grand unification". That's why I'm
explaining all of this--so that you can envision that future need
and anticipate it in your present work.
Here's another obstacle, which I guess we should designate (4).
You probably have 'sample.dat' and 'sample.ndx', which are yet
another sort of database in yet another binary format. It just
so happens that they fit the '.db4' paradigm nicely: they have
either one or two axes chosen from {Issue Age, Duration}. They
use a different format, standardized by the Society of Actuaries
(SoA), which provides programs to edit them. Those third-party
programs are extremely problematic. To subsume this data into
the grand unified format would save us much labor. And the SoA
has already made its standard tables available as xml, so
this fits neatly with our plan for (2).
> I've coded a small test program that uses mdgrid to edit TDBValue
> object and discovered that when the user enables that entity variation
> along at least 6 axis, the TDBValue object consumes at least 10Mb of
> memory (for its internal data storage) and it takes ages to convert
> the data when the user checks or uncheckes those variation checkboxes.
> A example: imagine that we want our entity from db4 file vary upon all
> dimensions its depends on:
> gender = 3
> class = 4
> smoking = 3
> issue_age = 100
> uw_basis = 5
> state = 53
> duration = 100
>
> 3 * 4 * 3 * 100 * 5 * 53 * 100 * sizeof(double) = 763.200.000 (almost 1Gb)
>
> I suppose that this example is not realistic, but it shows that the
> user with a couple of simple manipulations can make the program to
> crash (out of memory) or at least to hang for quite some time.
In practice, it could be reasonable to work with
3 * 4 * 3 * 100 * 5 * 100
doubles. Data tables in the SoA format are often something like
100 * 100, and our '.db4' files might have 3 X 4 X 3 X 5 arrays
of pointers to sets of such data tables.
The original version of the legacy editor just threw an exception
if the object size exceeded 64K, but that was in a 16-bit world.
That was unduly restrictive: it limited us to 8K elements of type
double.
It'd be good enough to pop up a warning if users try to create
anything larger than some limit like one or ten megabytes. If they
choose to ignore the warning, then we should let them. IIRC, the
standard in our office for the people who would be most likely to
use the product editor is four gigabytes of RAM. They have other
software that works with the sort of datasets described above, and
the vendor recommends using at least that much RAM. They seem to
double it every few years.
> Concerning those seven axis: i think i understand the meaning of first
> 6 of them -- gender, class, smoking, us_basis and state dimensions are
> described by the corresponding xenum<> specialisations from
> xenumtypes.hpp file containing the set of possible values and a set of
> value names.
Yes.
> Is it correct that the issue_age axis takes its values in the [1..100]
> range and it corresponds to the 'strike' axis we have in the
> test_mdrgid example (Strike option is taken as a set of values in
> [10..100] range with a step of 5)?
Yes. It differs from 'Strike', of course, in that
- the step is always one
- its lower limit is always fixed at 0
- only its upper limit is dynamic
- its upper limit has a maximum of "about" 100
In the past, 100 was good enough. Soon, we'll need to accommodate
regulatory changes that require something like 120. This upper
limit can never exceed the maximum number of years that a human
can live. It's good enough to use a constant defined in one place
with a hardcoded value of, say, 120 that we can easily change.
> The duration axis is the problematic one for me. I have not found no
> description of what it is in the code. Could you please briefly
> explain to me what it is and how should it be implemented or where
> should i learn it from (class or source file)? :)
Let me try to give a realistic example. We have an entity,
'CurrPolFee', which is the "policy fee" we currently charge
("currently" means we could charge more in the future, subject
to government approval). A "policy fee" just means a routine
flat charge: we might charge four dollars a month for every
insurance policy, regardless of the amount of insurance. It
covers costs that don't really vary by amount of insurance,
such as telephone customer support.
Now, we might find that customers use more telephone support
in the first five years after they buy a policy, so perhaps
we reduce the charge from four dollars to three dollars
after five years. The data would look like this:
{4, 4, 4, 4, 4, 3}
which is really a 1 X 1 X 1 X 1 X 1 X 1 X 6 array.
One could even imagine that males use more telephone support
than females: a 3 X 1 X 1 X 1 X 1 X 1 X 6 array with data like
4, 4, 4, 4, 4, 3
6, 6, 6, 6, 6, 5
5, 5, 5, 5, 5, 4
the third row being used in jurisdictions that do not permit
rates to vary by gender. In practice, discriminating by gender
in this way wouldn't be politically acceptable (though there's
a scientific basis for distinguishing mortality rates by gender),
so this is an artificial example, but it illustrates the point.
Of course, we might also grade charges linearly:
{4.0, 3.8, 3.6, 3.4, 3.2, 3.0}
In all these cases, the value given for the last duration is
replicated for all future years. That last duration has to be
under the user's control: next year, we might change this
charge to
{4.0, 3.5, 3.0}
It's natural to ask whether some sparse representation wouldn't
make sense: run-length encoding could compress our original
schedule of charges
{4, 4, 4, 4, 4, 3} --> {4 [5 years], 3 [thereafter]}
For now at least, I favor simplicity. As soon as we contemplate
five-year steps, someone will want three-year steps instead;
or five years of yearly changes followed by a series of ten-year
steps; or linear interpolation; and so on. The simple approach
we're using (one value per year, the last value being replicated
for all future years) has proved quite reasonable in practice.
There's very little documentation of 'Duration' in the code
because everything I said above is "obvious" to a user. This
comment
// All items have duration as their last axis. Duration comes last
// so that a pointer calculated from all preceding axes points to
// consecutive durational elements in contiguous storage.
is significant, though; perhaps that's "obvious" to you as a
programmer where it wouldn't be understood by a user. Often
I'm asked why I don't use some "real" database program: sql or
something. That would be overkill: these 'database' files really
ought to be about 25Kbytes; we have some that are ten times that
size, but that's as big as they've ever gotten. I certainly don't
want to go through some dbms api to retrieve values, because that
would slow things down, no matter how good the implementation.