emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New "make benchmark" target


From: Andrea Corallo
Subject: Re: New "make benchmark" target
Date: Mon, 30 Dec 2024 13:26:28 -0500
User-agent: Gnus/5.13 (Gnus v5.13)

Pip Cet <pipcet@protonmail.com> writes:

> "Eli Zaretskii" <eliz@gnu.org> writes:
>
> Top-posted TL;DR: let's call Andrea's code "make elisp-benchmarks" and
> include it now?  That would preserve the Git history and importantly (to
> me) reserve the name for now.
>
>>> Date: Mon, 30 Dec 2024 15:49:30 +0000
>>> From: Pip Cet <pipcet@protonmail.com>
>>> Cc: acorallo@gnu.org, stefankangas@gmail.com, mattiase@acm.org, 
>>> eggert@cs.ucla.edu, emacs-devel@gnu.org, joaotavora@gmail.com
>>>
>>> >> https://lists.gnu.org/archive/html/emacs-devel/2024-12/msg00595.html
>>> >
>>> > Thanks, but AFAICT this just says that you intended to use/extend ERT
>>> > to run this benchmark suite, but doesn't explain why you think using
>>> > ERT would be an advantage worthy of keeping.
>>>
>>> I think some advantages are stated in that email: the ERT tagging
>>> mechanism is more general, works, and can be extended (I describe one
>>> such extension).  All that isn't currently true for elisp-benchmarks.
>>
>> Unlike the rest of the test suite, where we need a way to be able to
>> run individual tests, a benchmark suite is much more likely to run as
>> a whole, because benchmarking a single kind of jobs in Emacs is much
>> less useful than producing a benchmark of a representative sample of
>> jobs.  So I'm not sure this particular aspect is such a serious
>
> Not my experience.  Running the entire suite is much more likely not to
> produce usable data due to such issues as CPU thermal management (for
> example: the first few tests are run at full clock speed and heat up the
> system so much that thermal throttling is activated; the next few tests
> are run at a reduced rate while the fan is running; eventually we run
> out of amperes that we're allowed to drain the battery by and reduce
> clock speed even further; this results in reduced temperature, so the
> fan speed is reduced, which means we will eventually decide to try a
> higher clock speed again, which will work for a while only before
> repeating the cycle.  The whole thing will appear regular enough we
> won't notice the data is bad, but it will be, until we rerun the test on
> the same system in a different room and get wildly different results).
> A single-second test run in a loop produces the occasional mid-stream
> result which is actually useful (and promptly lost to the averaging
> mechanism of elisp-benchmarks).

Yes, elisp-benchmark is running all the selected benchmarks at each
iteration, so that a single one cannot take advantaged of the initial
cool CPU state.  If unstable throttling on a specific system is a
problem this will show up as computed error for that test.  If a system
is throttling the right (and only) thing to do is to measure it, this is
in my experience what benchmarks do.

That said tipically Eli is right, the typical use of a benchmark suite
is to run it as a whole and look at the total results, this indeed
accounts for avg throttling as well.

> Benchmarking is hard, and I wouldn't have provided this very verbose
> example if I hadn't seen "paradoxical" results that can only be
> explained by such mechanisms.  We need to move away from average run
> times either way, and that requires code changes.

I'm not sure I understand what you mean, if we prefer something like
geo-mean in elisp-beanhcmarks we can change for that, should be easy.
I'm open to patches to elisp-benchmarks (and to its hypothetical copy in
emacs-core).  My opinion that something can potentially be improved in
it (why not), but I personally ATM don't understand the need for ERT.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]