[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD' and
From: |
Stefan Vargyas |
Subject: |
Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD' and `--quoting-style=WORD' |
Date: |
Thu, 22 May 2014 07:05:17 -0700 (PDT) |
> Date: Thu, 22 May 2014 12:28:22 +0100
> From: Pádraig_Brady <address@hidden>
> Subject: Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD'
> join -j2 <(stat -c '%s %n' /bin/ls /bin/cp | sort) <(sha1sum /bin/cp
> /bin/ls | sort)
> tr '\n' '\1' |
> sort |
> uniq -u ...
Your remarks are correct iff stat and sha1sum output *are* able to produce
consistently joinable outputs. However when attempting to employ such usage
patterns into *generally usable scripts*, one has to take care of possible
inconsistencies (leading to bugs!) occurring when file names contain SPACE,
TAB, NL and other such chars.
A solution would be to impose TAB only as field separator -- thus ensuring that
it cannot appear anywhere else. Then one might invoke join with "-t $'\t'". With
this condition, it should be clearer why the need of '--quoting-style=escape'
and '--digest-type=sha1' options and of '%S' format specifier for stat.
> There is no advantage of supporting this option in stat
> as that is only useful when a command needs to process all
> file names in a _single invocation_, like when sorting or accumulating etc.
> For stat one can efficiently:
>
> find ... -print 0 | xargs -r0 stat ...
>
> or
>
> find ... -exec stat {} +
One meaningful reason for single invocation is efficiency. The input to stat
can be huge (and in my initially evoked scenario in fact often is!) -- and
that possible large amount of data propagates downward the multiple pipelines
and fifos of your scenario above.
> Note also that sort has the --zero-terminated option, as do newer versions of
> join and uniq.
The fanciful '-0|--null' options refers to both input and output of sort. The
existing '-z|--zero-terminated' -- only to sort's output.
> This could be useful, however there is already the %N option for quoted file
> name.
>
> $ stat -c %N /bin/ls
> ‘/bin/ls’
> $ LANG=C src/stat -c %N /bin/ls
> '/bin/ls'
Recall the claimed consistency from above. In case of symlinks, %N produces
output like the one below:
$ touch /tmp/foo
$ ln -sv /tmp/foo /tmp/bar
`/tmp/bar' -> `/tmp/foo'
$ stat -c %N /tmp/bar
`/tmp/bar' -> `/tmp/foo'
$
Also, in case of symlinks, the digest sum computing programs do follow the
links, i.e. they actually compute digests for the content of the file to which
the symlink file points to:
$ sha1sum /tmp/foo /tmp/bar
da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/foo
da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/bar
The semantics of %S in the proposed patches is different however: the new stat
produces the digest of the *content* of the file itself. In case of symlinks
that content is obtained via 'areadlink_with_size':
$ stat2 -c '%S %n' /tmp/foo /tmp/bar
da39a3ee5e6b4b0d3255bfef95601890afd80709 /tmp/foo
469150566bd728fc90b4adf6495202fd70ec3537 /tmp/bar
Note that the STAT_* files of my initial usage scenario do have an intrinsic
value of themselves -- not only that of providing the means for verifying the
correctness of making ISO files or of burning DVDs. These files keep a quite
faithful record of content of the file system itself.
With many thanks for your thorough response,
Stefan Vargyas.