Re: best options to try to avoid bogging down user's systems with parall

On Sat, Dec 2, 2023 at 9:53 PM Britton Kerin <britton.kerin@gmail.com> wrote:
>
> I just made a script to pre-build all git revisions in a git-bisect,
> so the actual bisection testing that requires human attention can be
> done without being interrupted by a build each time.

Yeah, CPU-time is way cheaper than human time.

> I'd like to avoid making user system go unresponsive.

That is a good idea: Use unused capacity.

> Builds require
> really different memory etc. so it's hard to guess what's reasonable
> here. I'm considering this:
>
> parallel --nice=17 --load=80%
>
> I'd also like to address memory use, but I'm not sure how best to do it.
>
> Does --memsuspent accept at percent or not?

It does not.

> It seems like something
> like "don't start any more jobs and suspend some at 80% memory used"
> might be a reasonable guess but I don't know how to specify it. I'm
> reluctant to use fixed quantities of memory.

Currently that is your only option.

> I'd like to avoid --memfree because killing jobs might produce
> confusion about build results, and restarting them automatically might
> not work.

Yeah, --memsuspend sounds like the right thing.

> Any advice appreciated.

Can you checkout one revision, run '\time -v make', find "Maximum
resident set size (kbytes):" and use that for --memsuspend?

Your request has made me think, because it is a good, relevant and
practical question.

What are the good values to run jobs in parallel in the background
while keeping the system responsible, if all you know is that a single
job will complete without bogging down the current system?

And I must admit: I honestly do not know.

There are at least 3 limits to take into consideration:

* RAM
* Disk I/O
* CPU use

It is my experience that --nice 19 deals OK with CPU. Only if the jobs
each spawn a ridiculous amount of processes is this not enough. And
then --load 100% might help; at least to avoid starting new jobs. I
think it ought to be easy to make a test case that confirms this.

--limit “io 50” will not spawn new jobs if the disk I/O that
particular second is > 50%. But if jobs are I/O intensive at times and
I/O idle at other times then this might not be good enough. --limit
should probably have an option to suspend instead of kill – like
--memsuspend. Suspending the youngest jobs for while the oldest are
doing heavy I/O seems like a reasonable thing to do. We could probably
just split --limit into: --killlimit (which is the current behaviour)
and –suspendlimit (which would suspend instead of kill).

For RAM keeping 20% free of the total (physical) RAM is probably
enough. Currently that cannot be computed automatically, but that
seems to be a relatively small change. And I would imagine
“--memsuspend 10%” would be reasonable for many jobs (--memsuspend
will then start suspending jobs when reaching 20% free RAM, and
suspend all but one at 10% free RAM). Maybe it would make sense to do:
“--memsuspend 10% --memfree 5%” so free memory goes below 5%, kill off
the youngest and retry them later.

If you have good test cases to run on, please provide them.

But it might be time to give the limit infrastructure an overhaul.

Here is what I have been thinking (see also
https://savannah.gnu.org/bugs/?64992):

The idea here is to have 4 limits: CPU(load), mem, IO, and net.

--limit-cpu
--limit-net
--limit-mem
--limit-net

They will take 5 values: a:b:c:d:e

Normally the values with be percentages, such as: --limit-mem
60%:70%:75%:80%:95%

When the measurement reaches a certain percentage, it triggers an action:

When a is reached: stop spawning new jobs
When b is reached: start suspending jobs
When c is reached: all but one job suspended
When d is reached: start killing jobs
When e is reached: all but one job killed

And in reverse: If the measurement is lower than a limit, jobs will be
restarted/resumed/spawned.

I believe these "escalation steps" will make sense for many cases. For
some cases (such as IO) killing might not be needed, and then you can
simply set d:e to 101%:101% and thus never trigger the killing.

I welcome your thoughts on this.

/Ole

From:	William Bader
Subject:	Re: best options to try to avoid bogging down user's systems with parallel builds?
Date:	Tue, 12 Dec 2023 01:57:22 +0000