Re: New "make benchmark" target

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New "make benchmark" target

From:	Andrea Corallo
Subject:	Re: New "make benchmark" target
Date:	Mon, 06 Jan 2025 13:41:55 -0500
User-agent:	Gnus/5.13 (Gnus v5.13)

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Andrea Corallo <acorallo@gnu.org>
>> Cc: Eli Zaretskii <eliz@gnu.org>,  stefankangas@gmail.com,
>>   mattiase@acm.org,  eggert@cs.ucla.edu,  emacs-devel@gnu.org
>> Date: Mon, 06 Jan 2025 06:23:22 -0500
>> 
>> Pip Cet <pipcet@protonmail.com> writes:
>> 
>> > In particular, as you (Andrea) correctly pointed out, it is sometimes
>> > appropriate to use an average run time (or, non-equivalently, an average
>> > speed) for reporting test results; the assumptions needed for this are
>> > very significant and need to be spelled out explicitly.  The vast
>> > majority of "make benchmark" uses which I think should happen cannot
>> > meet these stringent requirements.
>> >
>> > To put things simply, it is better to discard outliers (test runs which
>> > take significantly longer than the rest).  Averaging doesn't do that: it
>> > simply ruins your entire test run if there is a significant outlier.
>> > IOW, running the benchmarks with a large repetition count is very likely
>> > to result in useful data being discarded, and a useless result.
>> 
>> As mentioned, I disagree with having some logic put in place to
>> arbitrarily decide which value is worth to be considered and which value
>> should be discarded.  If a system is producing noisy measures this has
>> to be reported as error of the measure.  Those numbers are there for
>> some real reason and have to be accounted.
>
> Without too deep understanding of the underlying issue: IME, if some
> sample can include outliers, it is always better to use robust
> estimators, rather than attempt to detect and discard outliers.
> That's because detection of outliers can decide that a valid
> measurement is an outlier, and then the estimation becomes biased.

100% agreed

> In practical terms, for estimating the mean, I can suggest to use the
> sample median instead of the sample average.  The median is very
> robust to outliers, and only slightly less efficient (i.e., converges
> a bit slower) than the sample average.

For my experience benchmarks typically use geo-mean, there's quite some
info around on why is that, ex [1].  The use of arithmetic mean in
elisp-benchmarks is an error of youth (I'm responsible of) which I think
should be fixed.

  Andrea

[1] <https://dl.acm.org/doi/pdf/10.1145/5666.5673>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: New "make benchmark" target, (continued)
- Re: New "make benchmark" target, Andrea Corallo, 2025/01/06
  - Re: New "make benchmark" target, Eli Zaretskii, 2025/01/06
    - Re: New "make benchmark" target, Andrea Corallo <=

Prev by Date: Re: Short functions
Next by Date: Re: Merging scratch/no-purespace to remove unexec and purespace
Previous by thread: Re: New "make benchmark" target
Next by thread: Reference to free variable ‘xterm-mouse-mode-called’
Index(es):
- Date
- Thread