coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFE: 'multisum' - read file once, compute multiple checksums


From: Pádraig Brady
Subject: Re: RFE: 'multisum' - read file once, compute multiple checksums
Date: Fri, 01 Nov 2013 15:13:57 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 11/01/2013 02:27 PM, Pádraig Brady wrote:
> From: https://bugzilla.redhat.com/1025675
> 

> File format
> -----------
> \?<checksum_type>:<checksum> [* ]<path>
> 
> records should be:
> * grouped by path
> * ordered by checksum type within one group

> You may also consider adding 'multisum' file format support to existing tools.
> For instance md5sum -c being able to verify all records starting with 'md5:'

So this part of the request is already handled in newer coreutils
where the checksum utils already support the --tag option to identify
the checksum used, and we reused the BSD format here to gain extra compat.

$ tee >(md5sum --tag) >(sha1sum --tag) < /etc/passwd >/dev/null
SHA1 (-) = 30529b9c1622452b4488f229e7f8d36cc49579ba
MD5 (-) = 6d8d8033d929f93998c08a30c92a5b8d

Using tee like above does have disadvantages.

1. Redundant write to /dev/null (Note you can't really pipe to another chksum 
util
as then the output from previous utils would go to that (the pipe is setup 
before the coprocesses)

2. This doesn't support multiple files well, since the file name isn't output,
and also it would be 1 process for file, which waste CPU

> Here's my proposal for 'multisum' behaviour:
>
> Usage [add]
>
>   -s, --checksum       checksum type
>   Note: Checksum type can be specified multiple times, but at least once.
>         If specified when verifying checksums, only checksums for given
>         checksum types will be verified.

Now if you look at this more generally, the use case is not specific to the
checksumming utils, and really comes down to processing files efficiently,
for which hopefully there are already the appropriate tools available.

If you did have a multisum util, then you would really want to be doing the
reading in one process and the checksumming in other processes/threads to take
advantage of multicore.

Now you could get much of that implicitly with separate checksum utilities
(processes) and file caching to avoid the multiple IO overhead.

But you would also have to be careful that you wouldn't have multiple processes
fighting over a disk head for example.

So these general system dependent properties would be best handled
with general utilities if possible.

To illustrate how you might split up file processing across CPUS
while taking advantage of cache, consider the following xargs command
which would be tuned for 2 CPUs (running md5sum & sha1sum). The runs are 
batched in
groups of 10 so that later files don't evict yet to be processed files from 
cache.
If you have small files then you could increase the -n value.
If you have more CPUs then you could increase the -P value.
Note also you may want to change the '&' to a ';' in the command below
if you had a mechanical disk rather than an SSD, to avoid multiple
processes fighting over a disk head.

$ seq 20 | xargs -n10 -P1 sh -c 'echo md5sum --tag "$0" "$@" & echo sha1sum 
--tag "$0" "$@"'
sha1sum --tag 1 2 3 4 5 6 7 8 9 10
md5sum --tag 1 2 3 4 5 6 7 8 9 10
sha1sum --tag 11 12 13 14 15 16 17 18 19 20
md5sum --tag 11 12 13 14 15 16 17 18 19 20

Note also the GNU parallel command for dividing up such workloads.

So currently I don't think a separate utility is required for this.

thanks,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]