|
From: | Ed Morton |
Subject: | Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout |
Date: | Thu, 29 Feb 2024 08:44:01 -0600 |
User-agent: | Mozilla Thunderbird |
No redirection: $ time { sed -i -n 'p' file; } real 0m0.027s user 0m0.000s sys 0m0.000s Redirection: $ time { sed -i -n 'p' file >/dev/null; } real 0m0.023s user 0m0.000s sys 0m0.000sThe SE answer I linked, https://unix.stackexchange.com/a/771263/133219, shows strace being used on gawk with a 10-line input file and there being 10 writes (same as number of input lines) when used without redirection (look at the "calls" column below)"
|$ strace -e trace=write -c gawk -i inplace 1 somefile % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000098 9 10 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000098 9 10 total |
vs 1 write when used with redirection :
|$ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000020 20 1 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000020 20 1 total |
so buffering does seem likely to be the source of the time difference. Regards, Ed. On 2/29/2024 8:32 AM, david kerns wrote:
glad you checked that... have you tried other commands? ... perhaps the closing of stdout by the shell before the fork/exec is causing it. On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net> wrote:David - that was 3rd-run timing to ensure caching wasn't the issue. Ed. On 2/29/2024 7:35 AM, david kerns wrote: swap the order (do the redirect one first) I suspect the input file was still cached for the 2nd run On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net> <mortoneccc@comcast.net> wrote: Someone on StackExchange was asking about their gawk script being slow and someone else (https://unix.stackexchange.com/a/771263/133219) pointed out that using `-i inplace` is an order of magnitude slower if you don't also redirect stdout which seems unintuitive at best. For example given a 1 million line input file created by: $ seq 1000000 > file1m and using: $ awk --version GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0) If we just reproduce it as-is using `-i inplace` the timing is: $ time { awk -i inplace '1' file1m; } real 0m2.544s user 0m0.265s sys 0m1.843s whereas if we redirect stdout even though there is no stdout produced: $ time { awk -i inplace '1' file1m >/dev/null; } real 0m0.236s user 0m0.187s sys 0m0.000s As you can see that second execution with stdout redirected ran an order of magnitude faster. The person who investigated thinks it's due to the first execution being considered "interactive" since stdout isn't technically being redirected and so doing line buffering vs the second execution being "non-interactive" due to stdout being redirected and so using a larger buffer. If that is the case, could gawk be updated to consider "inplace" editing as non-interactive? If not, I think it'd be worth a statement in the manual about this difference in performance between the 2. Ed.
[Prev in Thread] | Current Thread | [Next in Thread] |