Re: I/O spin loop

parallel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: I/O spin loop

From:	Anderson, Stuart B.
Subject:	Re: I/O spin loop
Date:	Thu, 10 Mar 2022 03:23:40 +0000
> On Mar 9, 2022, at 4:04 PM, Ole Tange <ole@tange.dk> wrote:
> 
> On Wed, Mar 9, 2022 at 7:21 PM Anderson, Stuart B. <sba@caltech.edu> wrote:
>>> On Mar 8, 2022, at 3:35 PM, Ole Tange <ole@tange.dk> wrote:
>>> On Tue, Mar 8, 2022 at 11:22 PM Anderson, Stuart B. <sba@caltech.edu> wrote:
>>>> parallel version 20190922 from EPEL 8 running a Rocky Linux 8 system 
>>>> occasionally gets into an I/O spin loop writing 8193 bytes of "x" to a 
>>>> deleted TMPDIR file and then immediately truncating it, e.g.,
>>> This is by design:
>>> https://www.gnu.org/software/parallel/parallel_design.html#disk-full
>> 
>> Any reason not to call statvfs() to see if there is free disk space?
> 
> It is unclear to me if statvfs is supported on all supported
> platforms. You should feel free to examine this.

statvs is available for FreeBSD, OpenBSD, and Linux. What other
platforms need to be checked?

> If it is not supported or if it requires installing more than the
> basic Perl package, you will need very convincing evidence.

I forgot parallel is a Perl app, but the statvfs() function from [g]libc
that provides information about a mounted filesystem, including free space
is available from https://metacpan.org/pod/Filesys::Statvfs

And if that is problematic for some reason there must be a portable
perl module that reports on how much free space there is in a filesystem.

>>> GNU Parallel does an exponential back off:
>>> 
>>> https://www.gnu.org/software/parallel/parallel_design.html#exponentially-back-off
>> 
>> There seems to be a problem with this for some long running jobs,
>> perhaps limited to nested parallel (see below).
> 
> I have the feeling the problem is not related to the long running job,
> but that the nested parallels run short lived jobs.

I am not deliberately running any short lived jobs (see below).

> It will be helpful if you can follow
> https://www.gnu.org/software/parallel/man.html#reporting-bugs

Here is the version I am using with more information below,
including the bash script that calls parallel

[root@zfs1 ~]# parallel --version
GNU parallel 20190922

>>>> Any thoughts on how to avoid this seemingly useless spin loop with high 
>>>> CPU and I/O resource consumption?
>>> 
>>> Can you show that there really is an I/O consumption? Do the writes
>>> actually reach your disk?
>> 
>> I can confirm this is not throttled once per second
> 
> And no one suggested it was. It is using an exponential back off, so
> it will *at*least* run once per second, and *at*least* once per job.
> Immediately after starting a job it will run ~1 per ms, but it will
> exponentially back off, so if no new job is started after ~10 seconds,
> it will run around once per second.

OK, my observation is that for a long running job I am not seeing an
exponential back off.

>> and leads to maxing out the I/O on a fast NVMe device with very high CPU 
>> utilization.
> 
> The high CPU utilization can be explained by having short lived jobs.
> Expect 2-10 ms CPU time per job.

I am running long lived jobs (seconds to hours).

> Your documentation does not show if data is actually written to the
> NVMe. On my systems the IO stays in RAM: It never reaches the physical
> disk.

After watching iostat show my /tmp NVMe drive was busy due to I/O from
parallel itself, I switched to setting TMPDIR=/dev/shm to preserve the
NVMe IOPS for other applications.

> strace will not show this. But to document this, you can try this:
> 
>    seq 1000000 | parallel true &
>    # The utilization is expected to stay at 0%
>    # If iostat shows the disk of $TMPDIR going from 0% to 100%
> utilization, your point is proven
>    iostat -dkx 1
>    # If you do not have iostat, vmstat will give you some indication, too.
>    # The above should give no increase in 'bo'. If it increases
> significantly, your point is proven.
>    vmstat 1

That example shows a write() call every few ms,

18:36:20.584733 write(14, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:36:20.589859 read(23, "", 8192)      = 0
18:36:20.590513 write(14, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:36:20.595206 read(23, "", 8192)      = 0
18:36:20.595921 write(14, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:36:20.600849 read(23, "", 8192)      = 0
18:36:20.601578 write(14, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:36:20.606652 read(23, "", 8192)      = 0

With iostat jumping from 1 w/s to ~400 once parallel starts. Note, for this 
exmple /tmp is Linux LV mirror of nvme4n1 and nvme5n1,

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s  %rrqm  
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme4n1          0.00    1.60      0.00      0.01     0.00     0.20   0.00  
11.11    0.00    0.38   0.00     0.00     7.00   0.50   0.08
nvme5n1          0.00    1.60      0.00      0.01     0.00     0.20   0.00  
11.11    0.00    0.00   0.00     0.00     7.00   0.37   0.06
nvme4n1          0.00    1.60      0.00      0.01     0.00     0.20   0.00  
11.11    0.00    0.38   0.00     0.00     7.00   0.50   0.08
nvme5n1          0.00    1.60      0.00      0.01     0.00     0.20   0.00  
11.11    0.00    0.12   0.00     0.00     7.00   0.37   0.06
nvme4n1          0.00    1.60      0.00      0.01     0.00     0.20   0.00  
11.11    0.00    0.38   0.00     0.00     7.00   0.50   0.08
# Start parallel
nvme5n1          0.00  368.20      0.00      1.47     0.00     1.20   0.00   
0.32    0.00    0.02   0.01     0.00     4.08   0.99  36.32
nvme4n1          0.00  368.20      0.00      1.47     0.00     1.20   0.00   
0.32    0.00    0.02   0.01     0.00     4.08   0.99  36.30
nvme5n1          0.00  431.80      0.00      1.74     0.00     0.20   0.00   
0.05    0.00    0.01   0.01     0.00     4.12   1.01  43.42
nvme4n1          0.00  431.80      0.00      1.74     0.00     0.20   0.00   
0.05    0.00    0.01   0.01     0.00     4.12   1.00  43.30
nvme5n1          0.00  464.60      0.00      1.89     0.00     0.20   0.00   
0.04    0.00    0.01   0.01     0.00     4.15   1.00  46.46

> It is expected that parallel in the above example will use 100% of a
> single CPU thread.
> 
>> The following example shows it happening twice per millisecond.
> 
> How on earth can 2*8KB*1000 (= 2000 iops and 16 MB/s) saturate your NVMe?
> 
> I still think you are only seeing CPU usage (which is expected due to
> short lived jobs), and you are not even seeing the 16 MB/s I/O on your
> NVMe.

Not sure, but I would prefer that checking for free disk space not generate
a continuous load of O(100-1000) write IOPS for hours on end.

>> [root@zfs1 ~]# ps auxw w| grep 836133
>> root      836133 98.2  0.0  55488 17672 pts/6    R+   Mar08 1277:14 
>> /usr/bin/perl /usr/bin/parallel --delay 2 -P /tmp/parallel.zfs.backup3 
>> parallel -I// syncoid {} // ::: $LIST
> 
> How long does a single syncoid take?

10s of seconds to hours.

> How many are run in parallel?

20 initially until parllel get to the tail of the runtime distribution
and is managing just a few long lived jobs.

> Twice per millisecond is expected immediately after a job is started,

> so if syncoid are short lived, you should see this all the time. If,
> however, a synciod is only started every 10 seconds, then you should
> see that behavior whenever a new job starts.

OK, something is going sideways with these nested parallel jobs as I
am seeing the continuous rapid calls to write() from parallel without
any new tasks being spawned,

[root@zfs1 ~]# timeout 10 strace -tt -p 836133 |& egrep -e 'fork|clone'
Terminated

Note, this is the same process I reported strace from yesterday as it is
still running a few long-lived syncoid jobs, but it has not started any
new ones since yesterday. The only system calls reported by strace are
the following,

18:50:22.462416 wait4(-1, 0x7ffc539e8a14, WNOHANG, NULL) = 0
18:50:22.462443 write(10, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:50:22.462474 write(10, "x", 1)       = 1
18:50:22.462502 ftruncate(10, 0)        = 0
18:50:22.462531 lseek(10, 0, SEEK_SET)  = 0
18:50:22.462556 lseek(10, 0, SEEK_CUR)  = 0
18:50:22.462582 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
18:50:22.462612 rt_sigaction(SIGCHLD, {sa_handler=0x7fe81f000dc0, sa_mask=[], 
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=SIG_DFL, 
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
18:50:22.462643 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
18:50:22.462672 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
18:50:22.462700 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
18:50:22.462729 rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], 
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=0x7fe81f000dc0, 
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
18:50:22.462759 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
18:50:22.462792 wait4(-1, 0x7ffc539e8a14, WNOHANG, NULL) = 0
18:50:22.462820 write(10, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
18:50:22.462851 write(10, "x", 1)       = 1
18:50:22.462880 ftruncate(10, 0)        = 0
18:50:22.462909 lseek(10, 0, SEEK_SET)  = 0
18:50:22.462935 lseek(10, 0, SEEK_CUR)  = 0
18:50:22.462961 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
18:50:22.462990 rt_sigaction(SIGCHLD, {sa_handler=0x7fe81f000dc0, sa_mask=[], 
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=SIG_DFL, 
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
18:50:22.463020 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
18:50:22.463050 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
18:50:22.463079 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
18:50:22.463108 rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], 
sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, {sa_handler=0x7fe81f000dc0, 
sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe81ed29c20}, 8) = 0
18:50:22.463138 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
18:50:22.463171 wait4(-1, 0x7ffc539e8a14, WNOHANG, NULL) = 0

Note, I suspect it is the calls to ftruncate() that are triggering a
metadata flush to the underlying filesystem that shows up under iostat.

Here is the full bash script I am using to launch parallel,

#!/bin/bash

PARALLEL=20                               # How many users to backup in parallel
PARALLELFILE=/tmp/parallel.$(basename $0) # File used to adjust value of 
PARALLEL after each user completes
export SYNCOID=/usr/sbin/syncoid          # Location of syncoid executable

#
# Run syncoid
#
# $1 user
# $2 remote backup system
#
syncoid () {
  declare -Ag DEST
  DEST[LHO]="zfs.backup@some.remote.host:some/path"
  OPT[LHO]="--sendoptions=ce --compress=none"
  [[ $# -eq 0 ]] && return

  $SYNCOID --no-sync-snap --recursive --sshcipher=aes128-ctr ${OPT[$2]} 
home1/$1 ${DEST[$2]}/$1 |& ts >> ~/syncoid/$2/$1
  retval=${PIPESTATUS[0]}
  echo -n $2:$1
  if [[ $retval -eq 0 ]]; then
    echo ""
  else
    echo " FAILED"
  fi
  return $retval
}
export -f syncoid

#
# Call function once with no arguments to initialize global associative array.
# Note, it is not possible to pass an associative array to parallel with --env
# so convert list of keys to a string.
#
syncoid
export LIST=${!DEST[@]}

cd /home1
echo $PARALLEL > $PARALLELFILE
(echo pe.o3; ls -d * | grep -v ^[j-z]) | env TMPDIR=/dev/shm parallel --delay 2 
-P $PARALLELFILE 'parallel -I// syncoid {} // ::: $LIST' |& ts
exit ${PIPESTATUS[-2]}  # exit with the parllel exit value not ts


Thanks for your help in tracking this down.


--
Stuart Anderson
sba@caltech.edu
[Prev in Thread]
Current Thread
[Next in Thread]
I/O spin loop, Anderson, Stuart B., 2022/03/08
- Re: I/O spin loop, Ole Tange, 2022/03/08
  - Re: I/O spin loop, Anderson, Stuart B., 2022/03/09
    - Re: I/O spin loop, Ole Tange, 2022/03/09
    - Re: I/O spin loop, Anderson, Stuart B. <=
    - Re: I/O spin loop, Ole Tange, 2022/03/10
Prev by Date: Re: Lower API Access
Next by Date: Re: I/O spin loop
Previous by thread: Re: I/O spin loop
Next by thread: Re: I/O spin loop
Index(es):
- Date
- Thread