[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RFC: Safely using xargs -P$NUM children's output? Need a new tool?
From: |
Denys Vlasenko |
Subject: |
RFC: Safely using xargs -P$NUM children's output? Need a new tool? |
Date: |
Thu, 2 May 2019 15:57:54 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 |
I'm working on improving a script in rpmbuild:
# Strip static libraries.
for f in `find "$RPM_BUILD_ROOT" -type f -a -exec file {} \; | \
grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug" | \
grep 'current ar archive' | \
sed -n -e 's/^\(.*\):[ ]*current ar archive/\1/p'`; do
$STRIP -g "$f"
done
for the case when that directory contains 100k files
(most of which are not ar archives).
See
https://bugzilla.redhat.com/show_bug.cgi?id=1691822
My first stab at it is:
# Sometimes this is run on trees with 100 thousand files. Be efficient:
# xargs -r -P$NPROC -n16 file | sed 's/: */: /'
# is used to
# * xargs: avoid fork overhead per each file
# * -P$NPROC: parallelize
# * -n16: avoid running just a few "file LIST" cmds with huge LISTs
# * sed 's/: */: /': defeat "file F1 FILE2" columnar formatting:
# | F1: type1 <--- extra spaces after colon
# | FILE2: type2
NPROC=`nproc`
for f in `find "$RPM_BUILD_ROOT" -type f | \
grep -v "^${RPM_BUILD_ROOT}/\?usr/lib/debug" | \
xargs -r -P$NPROC -n16 file | sed 's/: */: /' | \
grep 'current ar archive' | \
sed -n -e 's/^\(.*\):[ ]*current ar archive/\1/p'`; do
$STRIP -g "$f"
done
Stress-testing, however, of the
xargs -r -P$NPROC -n16 file | sed 's/: */: /'
construct revealed that with sufficiently large machines, pipe buffer
gets filled and "file" processes experience partial writes,
garbling the output. specifically, this:
find /usr -print0 | xargs -0r -P199 -n16 file | sed 's/: */: /' | sort
does not produce the same output every time, and diff-ing if clearly shows
partial writes creating overlapping output.
Eventually I managed to achieve correct operation this way:
find /usr -print0 | xargs -0r -P199 -n16 sh -c 'file "$@" | dd bs=1M iflag=fullblock
2>/dev/null' ARG0 >$TEMPFILE
sh -c 'file "$@" | dd' ARG0
- this construct runs this small shell script, supplying filenames as $1, $2,
$3...
(ARG0 is necessary, otherwise 1st filename would become $0, not $1)
- within this small shell script, nothing is parallelized - one "find" writes to one
"dd".
"dd bs=1M iflag=fullblock" is a rather crude method to ensure all input is
read, grouped
in one block, and written in one write() call. Without this, "file" processes
sometimes
seem to do short writes. This works fine for serially running processes, but
when
199 "file"s write to the same fd, short writes give opportunity for writes to
interleave.
Tried using "sed" instead of dd, but apparently "sed" also can do short
writes.
$TEMPFILE
Unlike a pipe, even very large writes (~1mbyte) are never partial when
writing to files.
"dd bs=1M iflag=fullblock 2>/dev/null" is ugly. No need to have fixed-size
buffer,
it should grow as needed. 1M buffer may be definitely enough for _this_ case,
but in other cases, the size may be more variable. Guessing it for very use
case is ugly.
Tried "tail -n999999" but it does not write output in one write().
I propose that we create a new tool, which, in Unix tradition, does just one
thing:
collects input until EOF, then writes it out in one write().
/usr/bin/coalesce ?
- RFC: Safely using xargs -P$NUM children's output? Need a new tool?,
Denys Vlasenko <=
Re: line buffering in pipes, Assaf Gordon, 2019/05/02