[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: found *yet* another performance issue on Gawk -M - comma formatting
From: |
arnold |
Subject: |
Re: found *yet* another performance issue on Gawk -M - comma formatting |
Date: |
Tue, 01 Feb 2022 02:20:18 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hello.
Thank you for the report. Please use the bug-gawk@gnu.org address for
any future reports.
And yes, this report was considerably clearer than your previous one.
I created a file with a number consisting of over 215 million random
digits. I ran this program:
$ cat x.awk
{
printf("%'f\n", $1)
}
on the file, after compiling gawk for profiling. Here are the
interesting results:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
37.50 0.72 0.72 2 360.01 360.01 mpg_strtoui
22.40 1.15 0.43 1 430.01 430.01 def_parse_field
20.83 1.55 0.40 31 12.90 12.90 rs1scan
19.27 1.92 0.37 1 370.01 370.01 mpg_maybe_float
....
Almost all the time is spent in simply scanning the input for
various purposes before converting it to a GMP value. Actually
formatting the value doesn't take much time.
It is thus not surprising that treating the input as a string
and doing gsub on it is faster.
If you are doing serious work with numbers with millions of digits,
gawk is not the right tool to be using. You would be better off with
Python or R or some other tool that is specialized for that kind
of work.
Arnold
"Jason C. Kwan" <jasonckwan@yahoo.com> wrote:
> Not sure if those gnu folks care what I report anymore, but I just
> found out earlier that just using the
>
> %’.f , or % \ 0 4 7 . f
>
> formatting string in printf( ), with gawk -Mbe , and a 275-million digit
> input, took 2 minutes 8.40 seconds
>
> The same input , using just a basic gsub( )-based approach in gawk -b,
> yielded the same correct answer is just 27.24 seconds
>
> if you’re interested in investigating, the 4 lines needed to replicate
> the comma formatting in standard gawk I came up is as follows :
>
>
> sub(/([0-9][0-9][0-9])+([.]|$)/, ",&") # allocate initial mark, multiple
> of 3
> gsub(/[^,.][^,.][^,.]/, "&,") # batch-process all 3-digit combos
> sub(/[,]+[.]+/, ".") # cleaning-up comma+period instance
> gsub(/^[^0-9]+|[^0-9]+$/, "") # fail-safe cleanup at head and
> tail
> is this readable enough ?
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: found *yet* another performance issue on Gawk -M - comma formatting,
arnold <=