coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request for coreutils: b2sum


From: Assaf Gordon
Subject: Re: feature request for coreutils: b2sum
Date: Mon, 31 Oct 2016 22:18:55 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0

Hello Pádraig and all,

I have some concerns about supporting different algorithm in b2sum.
[long email ahead...]

First, usability of multiple algorithms.
----
From a usability POV, it is rarely (never?) the case that making a program 
multithreaded or parallel, or optimizing for different CPU type changes the 
output of the program in a user-noticeable way.

That is, "gzip" vs "pigz", "bzip2" vs "pbzip2", "make" vs "make -j", "sort" vs "sort 
--parallel" - all produce the same output from the user POV (I'll note that gzip vs pigz do produce binary different output, but the content from the user 
POV is equivalent).
Similarly, Optimizing for 32bits vs 64bits should never alter the result from a 
user POV.

I'm not saying these optimizations are not worthwhile, but I do worry that in 
practice they constitute a completely different checksum.
Saying in the manual:
   "blake2bp - This is like ‘blake2b’ but takes advantage of multiple CPU 
cores."
Gives the impression that it is equivalent, and it is not.
The warning that is mentioned in the preceding paragraph will not be noticed by most 
users, and it's not in the "--help" or man page (and even if it was, users 
would still ignore it).
It's one thing if there's a "gotcha" about program behavior which existed for 
decades, preceding standardization, and we have to live with it for 
backward-comparability.
But this is a new feature, and we should avoid making it confusing.
In a previous message you wrote:
I might also drop the blake2s and blake2sp implementations
as 32 bit can get equivalent but somewhat slower functionality with
`b2sum -l 256 -a blake2b`

But it is not equivalent - it's a different result:

    $ echo a | ./src/b2sum -l256 -ablake2b
    be29a54b934581ab434fde713c16db07c3e0124a371daca7c33588be7526630e  -
    $ echo a | ./src/b2sum -ablake2s
    1e61fdb001508ebd3c559545024e7a58a67aeb124308a24bbd13f7ed09d9f2c7  -

If by "equivalent" you mean just "happens to be the same length of digest but 
different value",
then I fear many non-tech-savvy users would not be aware of this distinction.

Coreutils has a long history of dealing with full spectrum of users: from 
computer experts and complete novices.
I'm sure you have more experience than me about users' confusion, but to me 
this sounds like inviting troubles.

This also means that if someone has a slow 32bit computer which generates blake2 checksum 
and he uses "-blake2s" (which is the optimal option for him),
then he forces any future user of his checksums to use single-core 32bit algorithm 
"blake2s" regardless of any newer hardware they have.

Also,
Up until now, the name of the utility (e.g. "sha1sum") was enough to identify 
the program.
So of someone named their checksum files "data.sha1sum" - it was a good hint to use 
"sha1sum -c" - and the results should be valid.
But calling the file "data.b2sum" is not sufficient anymore - there are at least 4 
variables of "b2sum".
And more variants if we consider "--length"...


Second, the "--length" parameter.
----
This is something that will be relevant for future SHA3 and perhaps others, so 
I think we might want to take a second look and think ahead.
Up until now,
with the existing utilities md5sum,sha1sum,sha224,sha384,sha512 - the length of 
the digest in the checksum file was a usability hint regarding which algorithm 
it is. The length was fixed, and a checksum file with 40 characters what sha1. 
32 was md5, 128 was sha512.

Whether we like it or not, it was a good hint.

With sha3 and blake2, the digest defaults to 512 as well, using "sha512" loses 
that useful hint - but that's unavoidable.
What is a bigger problem is that with variable length digests in the same 
utility,
it becomes much harder to know what are the correct parameters.
I think that automatic length detection should be turned on automatically, even without 
"--tag".

Since I also believe that machines should work harder than people, it would be nice if we 
have an "--autodetect" kind of parameter
that will automatically test multiple algorithms based on the given digest 
length - it just takes more CPU time,
but can save some annoyances for users.


Third, multi-threaded scalability.
---
the multi-threaded version of 'blake2bp' uses fixed number of threads using 
openMP:

     #pragma omp parallel shared(S), num_threads(PARALLELISM_DEGREE)

With PARALLELISM_DEGREE being 4 for 'blake2bp' and 8 for 'blake2sp'.

But the implementation is not really scalable (in the sense that adding more 
threads will make things faster but identical result).
On one hand, reducing PARALLELISM_DEGREE (e.g. to 2) results in a different 
value,
and increasing it (e.g. to 8) results in a segfault (at least on my machine 
with 4 cores).

On the other hand, having fixed 4 cores means 'blake2bp' ALWAYS creates 3 threads (calls 
"clone(2)" three times),
regardless of how many cores the machine actually has. This seems like bad form 
IMHO.

It also means, if I understand the code correctly, that we can never increase 
the number of threads - it is fixed at four forever,
otherwise it will give different checksums. So regardless of future hardware 
improvements, we're stuck at 4 threads because it was a reasonable value in 
2016.



Last but not least, OMP_DYNAMIC
---

Using a fixed number in "#pragma omp parallel num_threads()" in OpenMP causes 
the code to ignore OMP_NUM_THREADS,
but it still adheres to OMP_DYNAMIC. If set to TRUE, the system doesn't have to 
provide 4 threads,
and the result of the digest is different:

On a machine with 2 cores:

     $ nproc
     2

     $ seq 10000 | OMP_DYNAMIC=FALSE ./src/b2sum -ablake2bp
     
824b2157073309456d6c515b062b96da5098f37561a462024931eaea5af524142cb811eb22f7dc2ceeca36dfae108bb71d7aca0d2a3072e8b65f877fe9008a4e
  -

     $ seq 10000 | OMP_DYNAMIC=TRUE ./src/b2sum -ablake2bp
     
6f032b13b235b541c091b4afd9e01159eb5417fbc9194daf90edb21da9170f82eb8dd62a85a14d3884b5d630a318e63f547e9309582967a6c058b1339843d57b
  -

Similar results happen on machines with 4 or more cores if the machine is 
overloaded.

I also suspect, but not sure about this, that the OpenMP allows the 
implementation to terminate the program
if the number of requested threads isn't available and OMP_DYNAMIC=FALSE .



To conclude,
I recommend implementing one, and only one, blake2b hash variant.
Chose the official reference one (if there is one),
and stick to it regardless of threads/bit-size.
In the future, I'm sure there will be both hardware and software improvements 
which could speed up the implementation while keeping the exact same algorithm 
intact for backward compatibility.


Thanks for reading so far,

regards,
 - assaf
















reply via email to

[Prev in Thread] Current Thread [Next in Thread]