parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting input into jobs


From: Ole Tange
Subject: Re: Splitting input into jobs
Date: Sun, 25 Apr 2021 14:50:56 +0200

On Sat, Apr 3, 2021 at 5:10 AM Fredrick Brennan <copypaste@kittens.ph> wrote:

> You must drop --bar to use -m. Then the equal distribution of the arguments 
> will happen with e.g.:
>
> ⇒ find build/SVG_layers/ -iname '*.svg' | parallel -m 'inkscape 
> --batch-process --actions 
> "select-all:all;verb:StrokeToPath;select-all;verb:SelectionUnion;export-plain-svg;"
>  --export-overwrite'

This is not what is happening. It is easier to see what is happening here:

$ count() { echo $#; }
$ export -f count
$ seq 50000 | parallel -j4 -m count
23693
21842
1117
1117
1117
1114

$ seq 50000 | parallel -j4 -m --bar count
12772
0% 1:49999=0s 12773 12774 12775 12776 12777 12778 12779 12780 12781
12782 12783 12784 1278
10921
0% 2:49998=0s 12773 12774 12775 12776 12777 12778 12779 12780 12781
12782 12783 12784 1278
10921
0% 3:49997=0s 34615 34616 34617 34618 34619 34620 34621 34622 34623
34624 34625 34626 3462
10921
0% 4:49996=3h28m19s 34615 34616 34617 34618 34619 34620 34621 34622
34623 34624 34625 3462
1117
0% 5:3348=13m56s 45536 45537 45538 45539 45540 45541 45542 45543 45544
45545 45546 45547 4
1117
0% 6:3347=13m56s 46653 46654 46655 46656 46657 46658 46659 46660 46661
46662 46663 46664 4
1117
0% 7:3346=13m55s 47770 47771 47772 47773 47774 47775 47776 47777 47778
47779 47780 47781 4
1114
0% 8:3345=13m52s 48887 48888 48889 48890 48891 48892 48893 48894 48895
48896 48897 48898 4

So the distribution of arguments happens whether you use --bar or not.

Compare that to -n 10000:

$ seq 50000 | parallel -j4 --bar -n 10000 count
10000
20% 1:4=0s 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
10011 10012 10013 1
10000
40% 2:3=0s 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010
10011 10012 10013 1
10000
60% 3:2=0s 30001 30002 30003 30004 30005 30006 30007 30008 30009 30010
30011 30012 30013 3
10000
80% 4:1=0s 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010
40011 40012 40013 4
10000
100% 5:0=0s 40001 40002 40003 40004 40005 40006 40007 40008 40009
40010 40011 40012 40013

Here --bar gets the percentage correct.

So why does --bar act so weird when used with -m?

--bar uses total_jobs(), which computes the total number of jobs. This
is easily computed correctly if you know that for every 10000
arguments there is a job.

But -m tries to fit as many arguments as it can on a single command
line (until it reaches EOF, where it splits the last arguments evenly
to all jobslots). In this it is pessimistic. It starts by assuming
that it can only fit 1 argument. Thus total_jobs() will return one job
for each arg. This explains the first example, where it
pessimistically thinks there will be 50000 jobs:

0% 1:49999 ...

total_jobs() is quite expensive to run. So the result is cached until
declared invalid. This explains these lines from above:

0% 2:49998 ...
0% 3:49997 ...
0% 4:49996 ...

The result is declared invalid when EOF is hit. Here the value is recomputed:

0% 5:3348 ...

After this the new value is used:

0% 6:3347 ...
0% 7:3346 ...
0% 8:3345 ...

This is clearly not optimal. The root of the problem is that you
cannot generate all commands in advance.

One obvious reason for this is:

  parallel -m echo '{= if(total_jobs() == 10) { $_="x"x10000 } =}'

You cannot generate that command before you know total_jobs(), so we
need a way to compute total_jobs() without generating that command.

A way forward might be to adjust total_jobs() regularly when using -m,
and to recompute the value based on the average number of arguments
read per command so far. Possibly whenever --bar requests
total_jobs().

But it seems like a lot of work for a situation that works correctly
if you use -n 10000. You just have to guess -n reasonably correct.

But maybe --bar should give a warning when used with -m/-X saying it
will be misleading and you can use -n to avoid this?


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]