[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Measuring MD5 overhead
From: |
Vadim Zeitlin |
Subject: |
Re: [lmi] Measuring MD5 overhead |
Date: |
Mon, 6 Apr 2020 20:19:06 +0200 |
On Mon, 6 Apr 2020 15:34:56 +0000 Greg Chicares <address@hidden> wrote:
GC> and I'd like to know how much relative performance would be impaired.
GC> Thus, 'git show 77626be8dc06':
GC>
GC> use first one of these commands, then the other
GC> wine ./lmi_wx_shared --data_path=/opt/lmi/data
GC> wine ./lmi_wx_shared --data_path=/opt/lmi/data --pyx=measure_md5
GC> to run some scenario like
GC> File | New | Illustration
GC> OK
GC> File | Print to PDF
GC> and compare the elapsed time shown on the statusbar, to see the cost
GC> of reauthenticating before generating each PDF.
GC>
GC> On my machine, running under 'wine', I see:
GC> 341 msec without '--pyx=measure_md5'
GC> 405 msec with '--pyx=measure_md5'
GC> and (405-341)/341 is about a twenty-percent penalty.
Here, maybe using slightly, but not materially, different files, I get
(minimal times of 10 runs) 77ms without measure_md5 and 150ms with it, i.e.
the difference is substantial.
Just for illustration, I've also ran the same test with the MSVS (release,
i.e. optimized) binary and the results were, somewhat surprisingly, 62ms
without measure_md5 and the same 150ms with it, for it.
GC> On the other hand, Kim sees no noticeably penalty, running lmi
GC> under native msw on a typical underpowered corporate laptop
GC> (with a typical dataset, which might be one-third the size of
GC> the '*.xyz' one used above):
GC>
GC> # first trial
GC> Without '--pyx=measure_md5', Output: 451 milliseconds
GC> With '--pyx=measure_md5', Output: 448 milliseconds
GC>
GC> # second trial
GC> Without '--pyx=measure_md5', Output: 452 milliseconds
GC> With '--pyx=measure_md5', Output: 438 milliseconds
The 3 times difference for the latter is not that surprising (my PC is 7.5
years old, but it was pretty high end back when I assembled it), but the
difference for the "normal" time is quite impressive... Worse, 60-70ms is
not really noticeable, but 450ms definitely is, so there is definitely
scope for some optimization here.
GC> Could I ask you to do the same (using native msw and the '*.xyz'
GC> dataset) and report your results here? If your results roughly
GC> agree with Kim's, then we should probably just resolve the
GC> "TODO ??" issue above by inhibiting the md5sum validation cache
GC> and revalidating before producing every report.
Unfortunately my results are quite different...
GC> > I thought we could decrease the time further by running several processes
GC> > in parallel, but this doesn't help -- it looks like the overhead of
GC> > launching a process is too high for such small tasks, even under Linux,
and
GC> > even if I use just 8 (== number of cores) processes in total. I'd like to
GC> > explore using several threads for executing this in parallel inside a
GC> > single process, normally this should result in a noticeable gain and this
GC> > is supposed to be simple to do in modern C++, in theory (but things have
an
GC> > annoying tendency to work somewhat differently in practice, so we'll see).
GC>
GC> It's worth looking for a way to make validation faster, because
GC> that might
GC> - reduce the revalidation penalty, making the decision above easier; and
GC> - perhaps even reduce lmi's startup time (regardless of how the decision
GC> above is made, always validating all data files at least once, at
GC> startup, has some nonzero cost).
GC>
GC> I would guess that threading won't help much, because the time it takes
GC> to (re)validate a file is probably dominated by file I/O.
We could read the files from multiple threads too, but I don't know if
it's really a good idea.
GC> But go ahead and try that if you like, because my guess may be wrong.
I'll try, although not immediately as I still have wxGrid changes to
finish... This shouldn't take long and even if we implement another
solution later, this could provide a nice improvement right now (I think
even office machines must have at least 2, or maybe even 4, cores).
GC> Here's another idea: reduce the amount of file I/O by redesigning the
GC> revalidation code. Running an illustration requires accessing certain files:
GC>
GC> - 'tables.{dat,ndx}', which are in a binary format that end users cannot
GC> readily modify--so it seems adequate to validate those only at startup;
Do we need to do it in order to immediately detect any tampering or
corruption? Or was this just the simplest way to do it and we wouldn't lose
anything if we postpone validating them until their first use?
GC> - 'whatever_product.{database,funds,policy,rounding,strata}, which most
GC> end users can modify using the product editor;
Sorry, but it's my day of stupid questions today: if they can be modified,
how does it work with the existing validation schema? AFAICS, wrap_fardel
target includes all these files in validated.md5, so changing them would
result in a validation failure during the next run. What am I missing?
GC> - '*.mst', which anyone can modify using a text editor...which is why we
GC> actually distribute "ROT256" versions named '*.xst' (it's inversion
GC> rather than rotation, but there's no standard short name for that,
GC> though we might say "bytewise 256s' complement'). That obfuscation is
GC> inconvenient for Kim and me;
[please count me in]
GC> and the original '*.mst' contents don't contain any trade secrets,
GC> so we don't care if anyone can read them--we just don't want anyone
GC> to modify them.
GC>
GC> For MD5 revalidation to provide security, we must perform it whenever data
GC> flows through some chokepoint, which can be any of these:
GC>
GC> (1) Each invocation of Authenticity::Assay() (if we inhibit its caching).
GC> This brute-force approach (revalidate every file that could possibly be
GC> used) was the only simple option when we were invoking an external md5sum
GC> program, but it's needlessly slow.
GC>
GC> (2)(a) Each XML file-read operation that goes through libxml2, if we use
GC> some compression algorithm (which provides sufficient opacity to inhibit
GC> casual users). But this works only for XML files, and we found practical
GC> difficulties with both libz and liblzma when we tried using them.
I don't even remember what these problems were, but I definitely agree
that keeping the files in their original text form would be preferable.
GC> (2)(b) Each MST file-read operation that goes through "ROT256". Thus,
GC> the union 2({a,b}) covers all the files we need to care about.
GC>
GC> (3) Use 'cache_file_reads.hpp' for all the data files listed above.
GC> We already use it for '*.database', for reasons of performance. Using
GC> it for all data files would presumably make lmi faster (a pure win,
GC> in and of itself), and it would also introduce a convenient chokepoint
GC> that we could use for (re)validation. Its documentation says:
GC> /// For each filename, the cache stores one instance, which is
GC> /// replaced by reloading the file if its write time has changed.
GC> It would be too harsh to prohibit all changes to all of these files
GC> (then the product editor wouldn't be usable), but we could figure out
GC> what to do in each case, e.g., prohibit using modified '*.mst' files
GC> without '--ash_nazg', or '*.{database,policy,...}' without '--mellon'.
GC>
GC> What do you think of (3)?
I didn't even know about file_cache class existence until today, but it
does look like a neat solution to the problem. I'll have to think a bit
more about what would be the best way to integrate it with validation, but
it certainly should be doable.
GC> Of course, we could combine (3) with (2)({a,b}): instead of using
GC> libxml2's attractive (but imperfectly integrated) decompression, we
GC> could perform decompression ourselves at the (3) chokepoint.
I'm also not really sure if this is worth it, but I'll think more about
this too -- right now I just wanted to reply quickly to let you know about
the results of my testing.
Regards,
VZ
pgpGWAK0nnBmN.pgp
Description: PGP signature