lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] [lmi-commits] master 9c510ad 16/22: Measure elapsed time for M


From: Vadim Zeitlin
Subject: Re: [lmi] [lmi-commits] master 9c510ad 16/22: Measure elapsed time for MD5 data-file validation
Date: Sun, 29 Mar 2020 19:58:21 +0200

 Hello,

 First of all, thanks a lot for merging this pull request (and sorry for a
number of small whitespace/formatting problems that slipped through our
reviews)!

 Additionally, concerning this commit and the questions raised by it:

On Sat, 28 Mar 2020 18:23:38 -0400 (EDT) Greg Chicares <address@hidden> wrote:

GC> branch: master
GC> commit 9c510ad08bb0ec0ee3e30fe28ec3ff085f0b90e5
GC> Author: Gregory W. Chicares <address@hidden>
GC> Commit: Gregory W. Chicares <address@hidden>
GC> 
GC>     Measure elapsed time for MD5 data-file validation
GC>     
GC>     It is anticipated that this commit will soon be reverted. This
GC>     instrumentation could have been placed on a throwaway branch, but it's
GC>     more convenient to keep it on the main trunk.
GC>     
GC>     Measure how long it takes to validate MD5 files by two methods:
GC>      - an external md5sum program, as in the past; and
GC>      - internally, as now.

 I'd expect there to be a constant difference (in favour of the internal
calculation), as the 2 methods use more or less the same code, but in the
external case we also have to pay the penalty for shelling out to another
process. Of course, I'm fully ready for my expectations to be wrong...

GC>     /opt/lmi/bin[0]$wine ./lmi_wx_shared --mellon \
GC>       --data_path=/opt/lmi/data --pyx=measure_md5
GC>     Assay: production 1 milliseconds
GC>     Assay: external program 96 milliseconds

 This would tend to indicate that this penalty is of order of 100ms.

GC>     ...and for a maximal 'validated.md5' prepared with the full list of
GC>     files in 'fardel_checksummed_files', using every known product, and
GC>     generating PDF illustrations several times (hence the repeated timings):
GC>     
GC>     /opt/lmi/bin[1]$wine ./lmi_wx_shared --mellon \
GC>       --data_path=/opt/lmi/data --pyx=measure_md5
GC>     Assay: production 87 milliseconds
GC>     Assay: external program 245 milliseconds
GC>     Assay: internal 97 milliseconds

 However here it's more like 150ms.

GC>     Assay: production 114 milliseconds
GC>     Assay: external program 180 milliseconds
GC>     Assay: internal 115 milliseconds
GC>     Assay: production 114 milliseconds
GC>     Assay: external program 181 milliseconds
GC>     Assay: internal 115 milliseconds
GC>     Assay: production 115 milliseconds
GC>     Assay: external program 183 milliseconds
GC>     Assay: internal 119 milliseconds

 And now it's down to 65ms.

 I don't know if the benchmarking data confirm or infirm my hypothesis, to
be honest. The differences of a few dozens milliseconds could well be due
to external factors when working with files on a not completely idle
system. To really check if the difference is constant, we'd need to
benchmark computing the hash of a much larger file.

GC>     Is the extra security worth the extra delay?

 I don't know about this neither, ~100ms is already noticeable and I think
it could easily be worse on slower machines and/or when using slower
storage.

 In fact, the main reason for writing this reply is that I'm surprised by
how long it takes to compute the hash. Using "openssl speed md5", I get
~300MB/s for 64 byte blocks and while I don't know how big the files we're
verifying are exactly, I'm pretty sure that they're nowhere close to 30MB
in size. So I wonder if it could be useful to profile the code doing this
calculation to check if we're not doing something stupid? There is not much
scope for improving MD5 calculation algorithm, of course, but we could
have some easy to fix problems in its implementation. And if we could
reduce the delay to a couple of milliseconds from the couple of hundreds,
then I think it would be simple to answer the question above positively.

 Please let me know if you think it would be worth looking at this,
VZ

Attachment: pgp6lTTYfg8An.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]