bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk -i inplace is an order of magnitude faster when also redirectin


From: Ed Morton
Subject: Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout
Date: Thu, 29 Feb 2024 08:44:01 -0600
User-agent: Mozilla Thunderbird

Yes, I tried the same with `sed` and there was no performance difference between:

No redirection:

   $ time { sed -i -n 'p' file; }

   real    0m0.027s
   user    0m0.000s
   sys     0m0.000s

Redirection:

   $ time { sed -i -n 'p' file >/dev/null; }

   real    0m0.023s
   user    0m0.000s
   sys     0m0.000s

The SE answer I linked, https://unix.stackexchange.com/a/771263/133219, shows strace being used on gawk with a 10-line input file and there being 10 writes (same as number of input lines) when used without redirection (look at the "calls" column below)"
|$ strace -e trace=write -c gawk -i inplace 1 somefile % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000098 9 10 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000098 9 10 total |

vs 1 write when used with redirection :

|$ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000020 20 1 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000020 20 1 total |

so buffering does seem likely to be the source of the time difference.

Regards,

    Ed.

On 2/29/2024 8:32 AM, david kerns wrote:
glad you checked that...
have you tried other commands? ... perhaps the closing of stdout by the
shell before the fork/exec is causing it.

On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>  wrote:

David - that was 3rd-run timing to ensure caching wasn't the issue.

     Ed.

On 2/29/2024 7:35 AM, david kerns wrote:

swap the order (do the redirect one first) I suspect the input file was
still cached for the 2nd run


On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>  
<mortoneccc@comcast.net>  wrote:


Someone on StackExchange was asking about their gawk script being slow
and someone else (https://unix.stackexchange.com/a/771263/133219)
pointed out that using `-i inplace` is an order of magnitude slower if
you don't also redirect stdout which seems unintuitive at best.

For example given a 1 million line input file created by:

     $ seq 1000000 > file1m

and using:

     $ awk --version
     GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0)

If we just reproduce it as-is using `-i inplace` the timing is:

     $ time { awk -i inplace '1' file1m; }

     real    0m2.544s
     user    0m0.265s
     sys     0m1.843s

whereas if we redirect stdout even though there is no stdout produced:

     $ time { awk -i inplace '1' file1m >/dev/null; }

     real    0m0.236s
     user    0m0.187s
     sys     0m0.000s

As you can see that second execution with stdout redirected ran an order
of magnitude faster. The person who investigated thinks it's due to the
first execution being considered "interactive" since stdout isn't
technically being redirected and so doing line buffering vs the second
execution being "non-interactive" due to stdout being redirected and so
using a larger buffer.

If that is the case, could gawk be updated to consider "inplace" editing
as non-interactive? If not, I think it'd be worth a statement in the
manual about this difference in performance between the 2.

      Ed.








reply via email to

[Prev in Thread] Current Thread [Next in Thread]