coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: Safely using xargs -P$NUM children's output? Need a new tool?


From: Denys Vlasenko
Subject: RFC: Safely using xargs -P$NUM children's output? Need a new tool?
Date: Thu, 2 May 2019 15:57:54 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

I'm working on improving a script in rpmbuild:

# Strip static libraries.
for f in `find "$RPM_BUILD_ROOT" -type f -a -exec file {} \; | \
        grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug"  | \
        grep 'current ar archive' | \
        sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
        $STRIP -g "$f"
done

for the case when that directory contains 100k files
(most of which are not ar archives).
See
https://bugzilla.redhat.com/show_bug.cgi?id=1691822

My first stab at it is:

# Sometimes this is run on trees with 100 thousand files. Be efficient:
# xargs -r -P$NPROC -n16 file | sed 's/:  */: /'
# is used to
# * xargs: avoid fork overhead per each file
# * -P$NPROC: parallelize
# * -n16: avoid running just a few "file LIST" cmds with huge LISTs
# * sed 's/:  */: /': defeat "file F1 FILE2" columnar formatting:
#   | F1:    type1 <--- extra spaces after colon
#   | FILE2: type2
NPROC=`nproc`
for f in `find "$RPM_BUILD_ROOT" -type f | \
        grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug" | \
        xargs -r -P$NPROC -n16 file | sed 's/:  */: /' | \
        grep 'current ar archive' | \
        sed -n -e 's/^\(.*\):[  ]*current ar archive/\1/p'`; do
        $STRIP -g "$f"
done

Stress-testing, however, of the
    xargs -r -P$NPROC -n16 file | sed 's/:  */: /'
construct revealed that with sufficiently large machines, pipe buffer
gets filled and "file" processes experience partial writes,
garbling the output. specifically, this:

find /usr -print0 | xargs -0r -P199 -n16 file | sed 's/:  */: /' | sort

does not produce the same output every time, and diff-ing if clearly shows
partial writes creating overlapping output.

Eventually I managed to achieve correct operation this way:

find /usr -print0 | xargs -0r -P199 -n16 sh -c 'file "$@" | dd bs=1M iflag=fullblock 
2>/dev/null' ARG0 >$TEMPFILE

sh -c 'file "$@" | dd' ARG0
  - this construct runs this small shell script, supplying filenames as $1, $2, 
$3...
  (ARG0 is necessary, otherwise 1st filename would become $0, not $1)
  - within this small shell script, nothing is parallelized - one "find" writes to one 
"dd".
"dd bs=1M iflag=fullblock" is a rather crude method to ensure all input is 
read, grouped
  in one block, and written in one write() call. Without this, "file" processes 
sometimes
  seem to do short writes. This works fine for serially running processes, but 
when
  199 "file"s write to the same fd, short writes give opportunity for writes to 
interleave.
  Tried using "sed" instead of dd, but apparently "sed" also can do short 
writes.
$TEMPFILE
  Unlike a pipe, even very large writes (~1mbyte) are never partial when 
writing to files.

"dd bs=1M iflag=fullblock 2>/dev/null" is ugly. No need to have fixed-size 
buffer,
it should grow as needed. 1M buffer may be definitely enough for _this_ case,
but in other cases, the size may be more variable. Guessing it for very use 
case is ugly.

Tried "tail -n999999" but it does not write output in one write().

I propose that we create a new tool, which, in Unix tradition, does just one 
thing:
collects input until EOF, then writes it out in one write().
/usr/bin/coalesce ?







reply via email to

[Prev in Thread] Current Thread [Next in Thread]