bug-findutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Please advise work around or bug fix


From: Bernhard Voelker
Subject: Re: Please advise work around or bug fix
Date: Wed, 24 Mar 2021 22:13:08 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

Hi Kam,

On 3/24/21 6:17 PM, Yuen, Kam-Kuen CIV USARMY DEVCOM SC (USA) via Bug reports 
for the GNU find utilities wrote:
> I am running the following command and the "ls" command gives error message 
> that the file cannot be found.  The problem is that the filename has spaces 
> as part of the filename.
> The purpose is to find all files that exceeding file size of 1k.  Filename 
> might include spaces, special character like '
> 
> find . -size +1k -print | xargs ls -sd

There is no bug in any of the tools involved in this command line, find(1), 
xargs(1) and ls(1).
It is merely a wrong assumption about how they work (together).

Assumimg the above search will match the 2 files:

  $ touch 'This is a Test'
  $ touch  ' This is   another Test'

  $ ls -log
  total 0
  -rw-r--r-- 1 0 Mar 24 21:36 ' This is   another Test'
  -rw-r--r-- 1 0 Mar 24 21:35 'This is a Test'

find(1) will print the file names matching the criteria, separated by a newline 
character.
E.g.:
  This is a Test <newline>
   This is another Test <newline>

Shown as hex output:

  $ find . -type f | od -tx1z
  0000000 2e 2f 20 54 68 69 73 20 69 73 20 20 20 61 6e 6f  >./ This is   ano<
  0000020 74 68 65 72 20 54 65 73 74 0a 2e 2f 54 68 69 73  >ther Test../This<
  0000040 20 69 73 20 61 20 54 65 73 74 0a                 > is a Test.<
  0000053

xargs(1) reads the entries from standard input, and assumes that the entries 
are per default
separated by a <blank> character or a <newline>.  See POSIX:

  [...] arguments in the standard input are separated by unquoted <blank> 
characters,
  unescaped <blank> characters, or <newline> characters.

Also 'man xargs' documents this quite at the top:

  [...] delimited by blanks [...] or newlines

Wit the above input from find(1), this means that xargs(1) recognizes the 
following entries:
- 'This'
- 'is'
- 'a'
- 'Test'
- 'This'
- 'is'
- 'another'
- 'Test'

Note that blanks in the file names printed by find(1) will lead to separate
entries, with extra blanks already ignored.

As all of the above 8 entries can easily be packed into one invocation of the 
command
to run, ls(1), it is started with those 8 separate arguments.
strace(1) shows what will be executed:

  $ find . -type f | strace -ve execve xargs ls -logd
  [...]
  execve("/usr/bin/ls", ["ls", "-logd", "./", "This", "is", "another", "Test", 
"./This", "is", "a", "Test"], ...) = 0

Obviously, ls(1) will probably not be able to stat(2) any of the files
(or in the worst case accidentally ones which have one of the shorter names).

> 1)      The env is cygwin64 on Windows 10
> 
> 2)      Filename include space or special character
> 
> 3)      When running "ls" command directly on the folder, the screen show  " 
> ' " character surrounding the filename e.g. 'This is a Test Case With spaces 
> in Filename.pdf'

As the output is a terminal, ls(1) defaults to quoting each file name properly 
so that
it coule be copy+pasted safely to another command. Although there are 
discussions about
this feature on the GNU coreutils mailing list, I personally consider this is a 
good thing.

> 4)      In the case the filename already has ' special character, the "ls" 
> command shows the filename with double " around the filename e.g. " This is a 
> Tester's File.pdf"

The same here: ls(1) quotes the file name so that it can be copy+pasted safely.
And note that this also includes the leading blank in the file name:  " This 
....".

> 5)      When saving simple "ls" output to a file, do not see the surrounding 
> character

Indeed, when printing to a file, ls(1) must only print the original characters 
of the file names
without quoting.

> 6)      Trying to use the -0 option with xargs but it complains the argument 
> line too long

When using 'xargs -0', then the producer of the input also has to adhere to the 
chosen
convention to separate the entries by a NUL character instead of newlines.
'man xargs' says:

  -0, --null
  ...
  The GNU find -print0 option produces input suitable for this mode.

> Can you advise How to handle filename with hidden character like ' or space 
> or to report file size of current and subdirectories

There are several safer alternatives, all of them documented in the GNU 
findutils manual.
  https://www.gnu.org/software/findutils/manual/find.html

E.g.

  # Tell find(1) to also use the NUL character as a separator: use -print0.
  # This is safe for really all possible file names, including those with 
single or double quotes,
  # tabs and blanks, and finally also newlines.  Yes, the only character which 
cannot occur
  # is the NUL character.
  $ find . -size +1k -print0 | xargs -0 ls -sd

Note that xargs(1) will invoke ls(1) also if find(1) didn't match any file in 
the above example.
Better to use the -r, --no-run-if-empty option:

  $ find . -size +1k -print0 | xargs -r0 ls -sd

FWIW: One drawback is that there is a small race condition between the time 
find(1) is examining
the file and the time ls(1) will see it: one has to be aware that file system 
is constantly changing.

Another alternative is to let find(1) directly print the file size and file 
name.
This avoid the race condition.

  $ find . -size +1k -printf '%s %f\n'

Obviously, the output is not safe to process by another tool when a file name 
contains a file name,
but for the human eyes its probably good enough.

Furthermore, there are also alternatives with other tools, e.g. the du(1) 
command from the
GNU coreutils has a -t, --threshold option to filter files by their sizes (but 
also outputs directories):

  $ du -at +1k

Hope this helps.

Have a nice day,
Berny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]