bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk -i inplace is an order of magnitude faster when also redirectin


From: arnold
Subject: Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout
Date: Thu, 29 Feb 2024 10:01:30 -0700
User-agent: Heirloom mailx 12.5 7/5/10

Adding a setvbuf() call didn't work.

Using freopen() instead of playing games with the file descriptor
might work, but it's also a hassle.

Andy - can you look at this? Try: in inplace_begin
- dup stdout to a new fd and save it as now
- do freopen on stdout to the file. this should set block buffering

In inplace_end, do freopen of /dev/fd/NNN where NNN is the duped
fd from the original stdout.

Thanks,

Arnold

arnold@skeeve.com wrote:

> I looked at this briefly. It makes sense as to why it's working
> the way it currently does.  The extension temporarily replaces standard
> output's file descriptor with one on a new temporary file.
> But the rest of the stdio.h mechanics are left alone. So if stdout
> was initially a tty, it remains line buffered; if it was a file,
> it's block buffered.
>
> According to the setvbuf(3) man page, setvbuf() can only be called
> before any I/O operations are done on a FILE *.  I'm not sure if
> it's safe to do so in the extension, but maybe it is.
>
> I will poke at it a little bit; it's not a cut-and-dried easy fix.
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
> > No problem. Trying again to post the strace output as it got mangled by 
> > something in transit last time:
> >
> > The SE answer I linked, https://unix.stackexchange.com/a/771263/133219, 
> > shows strace being used on gawk with a 10-line input file and there 
> > being 10 writes (same as number of input lines) when used without 
> > redirection (look at the "calls" column below)"
> >
> >     $ strace -e trace=write -c gawk -i inplace 1 somefile
> >     % time     seconds  usecs/call     calls    errors syscall
> >     ------ ----------- ----------- --------- --------- ----------------
> >     100.00    0.000098           9        10           write
> >     ------ ----------- ----------- --------- --------- ----------------
> >     100.00    0.000098           9        10           total
> >
> > vs 1 write when used with redirection :
> >
> >     $ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
> >     % time     seconds  usecs/call     calls    errors syscall
> >     ------ ----------- ----------- --------- --------- ----------------
> >     100.00    0.000020          20         1           write
> >     ------ ----------- ----------- --------- --------- ----------------
> >     100.00    0.000020          20         1           total
> >
> >
> >
> > On 2/29/2024 8:47 AM, david kerns wrote:
> > > sorry for doubting your due diligence
> > >
> > > On Thu, Feb 29, 2024 at 7:44 AM Ed Morton <mortoneccc@comcast.net> wrote:
> > >
> > >     Yes, I tried the same with `sed` and there was no performance
> > >     difference between:
> > >
> > >     No redirection:
> > >
> > >         $ time { sed -i -n 'p' file; }
> > >
> > >         real    0m0.027s
> > >         user    0m0.000s
> > >         sys     0m0.000s
> > >
> > >     Redirection:
> > >
> > >         $ time { sed -i -n 'p' file >/dev/null; }
> > >
> > >         real    0m0.023s
> > >         user    0m0.000s
> > >         sys     0m0.000s
> > >
> > >     The SE answer I linked,
> > >     https://unix.stackexchange.com/a/771263/133219, shows strace being
> > >     used on gawk with a 10-line input file and there being 10 writes
> > >     (same as number of input lines) when used without redirection
> > >     (look at the "calls" column below)"
> > >>     |$ strace -e trace=write -c gawk -i inplace 1 somefile % time
> > >>     seconds usecs/call calls errors syscall ------ -----------
> > >>     ----------- --------- --------- ---------------- 100.00 0.000098
> > >>     9 10 write ------ ----------- ----------- --------- ---------
> > >>     ---------------- 100.00 0.000098 9 10 total |
> > >
> > >     vs 1 write when used with redirection :
> > >
> > >>     |$ strace -e trace=write -c gawk -i inplace 1 somefile >
> > >>     /dev/null % time seconds usecs/call calls errors syscall ------
> > >>     ----------- ----------- --------- --------- ----------------
> > >>     100.00 0.000020 20 1 write ------ ----------- -----------
> > >>     --------- --------- ---------------- 100.00 0.000020 20 1 total |
> > >
> > >     so buffering does seem likely to be the source of the time difference.
> > >
> > >     Regards,
> > >
> > >         Ed.
> > >
> > >     On 2/29/2024 8:32 AM, david kerns wrote:
> > >>     glad you checked that...
> > >>     have you tried other commands? ... perhaps the closing of stdout by 
> > >> the
> > >>     shell before the fork/exec is causing it.
> > >>
> > >>     On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>  
> > >> <mailto:mortoneccc@comcast.net>  wrote:
> > >>
> > >>>     David - that was 3rd-run timing to ensure caching wasn't the issue.
> > >>>
> > >>>          Ed.
> > >>>
> > >>>     On 2/29/2024 7:35 AM, david kerns wrote:
> > >>>
> > >>>     swap the order (do the redirect one first) I suspect the input file 
> > >>> was
> > >>>     still cached for the 2nd run
> > >>>
> > >>>
> > >>>     On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>  
> > >>> <mailto:mortoneccc@comcast.net>  <mortoneccc@comcast.net>  
> > >>> <mailto:mortoneccc@comcast.net>  wrote:
> > >>>
> > >>>
> > >>>     Someone on StackExchange was asking about their gawk script being 
> > >>> slow
> > >>>     and someone else (https://unix.stackexchange.com/a/771263/133219)
> > >>>     pointed out that using `-i inplace` is an order of magnitude slower 
> > >>> if
> > >>>     you don't also redirect stdout which seems unintuitive at best.
> > >>>
> > >>>     For example given a 1 million line input file created by:
> > >>>
> > >>>          $ seq 1000000 > file1m
> > >>>
> > >>>     and using:
> > >>>
> > >>>          $ awk --version
> > >>>          GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 
> > >>> 6.3.0)
> > >>>
> > >>>     If we just reproduce it as-is using `-i inplace` the timing is:
> > >>>
> > >>>          $ time { awk -i inplace '1' file1m; }
> > >>>
> > >>>          real    0m2.544s
> > >>>          user    0m0.265s
> > >>>          sys     0m1.843s
> > >>>
> > >>>     whereas if we redirect stdout even though there is no stdout 
> > >>> produced:
> > >>>
> > >>>          $ time { awk -i inplace '1' file1m >/dev/null; }
> > >>>
> > >>>          real    0m0.236s
> > >>>          user    0m0.187s
> > >>>          sys     0m0.000s
> > >>>
> > >>>     As you can see that second execution with stdout redirected ran an 
> > >>> order
> > >>>     of magnitude faster. The person who investigated thinks it's due to 
> > >>> the
> > >>>     first execution being considered "interactive" since stdout isn't
> > >>>     technically being redirected and so doing line buffering vs the 
> > >>> second
> > >>>     execution being "non-interactive" due to stdout being redirected 
> > >>> and so
> > >>>     using a larger buffer.
> > >>>
> > >>>     If that is the case, could gawk be updated to consider "inplace" 
> > >>> editing
> > >>>     as non-interactive? If not, I think it'd be worth a statement in the
> > >>>     manual about this difference in performance between the 2.
> > >>>
> > >>>           Ed.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >



reply via email to

[Prev in Thread] Current Thread [Next in Thread]