[lmi] change file formats to XML

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] change file formats to XML

From:	Vaclav Slavik
Subject:	[lmi] change file formats to XML
Date:	Fri, 26 Feb 2010 15:56:47 +0100

Hi,

I uploaded the XML file formats patch to Savannah:

   https://savannah.nongnu.org/patch/index.php?7101

Some comments about the patch:

(0) The API for XML formats I/O is in xml_serialize.hpp. By default,
stream operator<< and operator>> are used for storing value types, but
it can be -- and is -- customized by specializing
xml_serialize::type_io<> template.


(1) "Legacy" formats support

After applying this patch, LMI can still read (but not write) previous
(flat-file or binary) file formats. Unless LMI_NO_LEGACY_FORMAT is
defined when compiling, that is.


(2) New file formats

XML formats use extensions with "x" added, e.g. *.xfnd or *.xpol. This
is to allow reading older versions of the format. Alternatively, we
could either switch formats in one step (if you can mass-convert all
data files at once) or use the same extension and detect which of the
loaders should be used by checking whether the file is XML or not.


(3) Performance

Not so great. I modified product_files to both read and write sample
files, 100 times in a row. Here are its timings on my system:

  legacy:
    100x write   = 1.80s
    100x read    = 0.70s

  xml:
    100x write   = 7.26s
    100x read    = 4.98s

That's 4 times as slow writes and over 7 times as slow reads. On the
other hand, that's under 100ms to do the slowest thing with all product
files once, so maybe it would be acceptable to check this in at least
temporarily?

As for read spead, get_property() is inefficiently implemented, but that
shouldn't show up unless there's lot of properties on one object, which
there isn't. I suspect some copying inside xmlwrapp, but I didn't
investigate it yet.

I'll focus on the performance later if needed. For now, I'd like to know
what you think about this approach globally, and the serialization API
in particular.


(4) File sizes

Unsurprisingly, XML files are larger. There's some room for improvement,
e.g. using <i> instead of <item> for vector items, if size matters. Or
we can compress them with gzip:

-rw-r--r-- 1 vasek vasek  24259 2010-02-26 15:41 sample.db4
-rw-r--r-- 1 vasek vasek     34 2010-02-26 15:41 sample.fnd
-rw-r--r-- 1 vasek vasek    575 2010-02-26 15:41 sample.pol
-rw-r--r-- 1 vasek vasek     56 2010-02-26 15:41 sample.rnd
-rw-r--r-- 1 vasek vasek    266 2010-02-26 15:41 sample.tir
-rw-r--r-- 1 vasek vasek 142745 2010-02-26 15:35 sample.xdb4
-rw-r--r-- 1 vasek vasek   4389 2010-02-26 15:35 sample.xdb4.gz
-rw-r--r-- 1 vasek vasek    185 2010-02-26 15:35 sample.xfnd
-rw-r--r-- 1 vasek vasek    147 2010-02-26 15:35 sample.xfnd.gz
-rw-r--r-- 1 vasek vasek   4976 2010-02-26 15:35 sample.xpol
-rw-r--r-- 1 vasek vasek   1309 2010-02-26 15:35 sample.xpol.gz
-rw-r--r-- 1 vasek vasek   1273 2010-02-26 15:35 sample.xrnd
-rw-r--r-- 1 vasek vasek    296 2010-02-26 15:35 sample.xrnd.gz
-rw-r--r-- 1 vasek vasek   5594 2010-02-26 15:35 sample.xtir
-rw-r--r-- 1 vasek vasek    615 2010-02-26 15:35 sample.xtir.gz


(5) Versioning:

I opted for implicit versioning. As long as there's just one version of
the format, no versioning code is needed. If the format changed, then
the respective type_io<T>::from_xml() would be updated to deal with it.
For example:

- When a new field is added and has a sensible default, load with with
   get_property(node, "foo", foo, default_foo_value). Nothing else is 
   needed to read both versions. In the other direction, unrecognized
   properties are simply ignored.

- If a field semantics changes, rename it in the format. Then, read 
   (and correctly interpret) the old field only if the new field isn't
   found.

- If needed, we could add explicit versions on serialized keys, e.g.

     <coi_rate version="2">
       <decimals>8</decimals>
       <style>Downward</style>
     </coi_rate>

   (We'd have to add code for reading the attribute later.)


(6) DatabaseNames enum is treated as special case in the one place where
it's used (to serialize TDBValue::key). Strong-arming it into mc_enum<>
would IMHO not be worth the effort, and as the comment in relevant code
explains, adding type_io<DatabaseNames> specialization would be a bit of
a mess too.

Regards,
Vaclav

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] change file formats to XML, Vaclav Slavik <=

Prev by Date: Re: [lmi] reusing mc_enum<> for serialization into XML
Previous by thread: [lmi] add mc_enum<rounding_style>
Index(es):
- Date
- Thread