[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk -i inplace is an order of magnitude faster when also redirectin
From: |
arnold |
Subject: |
Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout |
Date: |
Thu, 29 Feb 2024 10:01:30 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Adding a setvbuf() call didn't work.
Using freopen() instead of playing games with the file descriptor
might work, but it's also a hassle.
Andy - can you look at this? Try: in inplace_begin
- dup stdout to a new fd and save it as now
- do freopen on stdout to the file. this should set block buffering
In inplace_end, do freopen of /dev/fd/NNN where NNN is the duped
fd from the original stdout.
Thanks,
Arnold
arnold@skeeve.com wrote:
> I looked at this briefly. It makes sense as to why it's working
> the way it currently does. The extension temporarily replaces standard
> output's file descriptor with one on a new temporary file.
> But the rest of the stdio.h mechanics are left alone. So if stdout
> was initially a tty, it remains line buffered; if it was a file,
> it's block buffered.
>
> According to the setvbuf(3) man page, setvbuf() can only be called
> before any I/O operations are done on a FILE *. I'm not sure if
> it's safe to do so in the extension, but maybe it is.
>
> I will poke at it a little bit; it's not a cut-and-dried easy fix.
>
> Arnold
>
> Ed Morton <mortoneccc@comcast.net> wrote:
>
> > No problem. Trying again to post the strace output as it got mangled by
> > something in transit last time:
> >
> > The SE answer I linked, https://unix.stackexchange.com/a/771263/133219,
> > shows strace being used on gawk with a 10-line input file and there
> > being 10 writes (same as number of input lines) when used without
> > redirection (look at the "calls" column below)"
> >
> > $ strace -e trace=write -c gawk -i inplace 1 somefile
> > % time seconds usecs/call calls errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00 0.000098 9 10 write
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00 0.000098 9 10 total
> >
> > vs 1 write when used with redirection :
> >
> > $ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
> > % time seconds usecs/call calls errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00 0.000020 20 1 write
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00 0.000020 20 1 total
> >
> >
> >
> > On 2/29/2024 8:47 AM, david kerns wrote:
> > > sorry for doubting your due diligence
> > >
> > > On Thu, Feb 29, 2024 at 7:44 AM Ed Morton <mortoneccc@comcast.net> wrote:
> > >
> > > Yes, I tried the same with `sed` and there was no performance
> > > difference between:
> > >
> > > No redirection:
> > >
> > > $ time { sed -i -n 'p' file; }
> > >
> > > real 0m0.027s
> > > user 0m0.000s
> > > sys 0m0.000s
> > >
> > > Redirection:
> > >
> > > $ time { sed -i -n 'p' file >/dev/null; }
> > >
> > > real 0m0.023s
> > > user 0m0.000s
> > > sys 0m0.000s
> > >
> > > The SE answer I linked,
> > > https://unix.stackexchange.com/a/771263/133219, shows strace being
> > > used on gawk with a 10-line input file and there being 10 writes
> > > (same as number of input lines) when used without redirection
> > > (look at the "calls" column below)"
> > >> |$ strace -e trace=write -c gawk -i inplace 1 somefile % time
> > >> seconds usecs/call calls errors syscall ------ -----------
> > >> ----------- --------- --------- ---------------- 100.00 0.000098
> > >> 9 10 write ------ ----------- ----------- --------- ---------
> > >> ---------------- 100.00 0.000098 9 10 total |
> > >
> > > vs 1 write when used with redirection :
> > >
> > >> |$ strace -e trace=write -c gawk -i inplace 1 somefile >
> > >> /dev/null % time seconds usecs/call calls errors syscall ------
> > >> ----------- ----------- --------- --------- ----------------
> > >> 100.00 0.000020 20 1 write ------ ----------- -----------
> > >> --------- --------- ---------------- 100.00 0.000020 20 1 total |
> > >
> > > so buffering does seem likely to be the source of the time difference.
> > >
> > > Regards,
> > >
> > > Ed.
> > >
> > > On 2/29/2024 8:32 AM, david kerns wrote:
> > >> glad you checked that...
> > >> have you tried other commands? ... perhaps the closing of stdout by
> > >> the
> > >> shell before the fork/exec is causing it.
> > >>
> > >> On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>
> > >> <mailto:mortoneccc@comcast.net> wrote:
> > >>
> > >>> David - that was 3rd-run timing to ensure caching wasn't the issue.
> > >>>
> > >>> Ed.
> > >>>
> > >>> On 2/29/2024 7:35 AM, david kerns wrote:
> > >>>
> > >>> swap the order (do the redirect one first) I suspect the input file
> > >>> was
> > >>> still cached for the 2nd run
> > >>>
> > >>>
> > >>> On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>
> > >>> <mailto:mortoneccc@comcast.net> <mortoneccc@comcast.net>
> > >>> <mailto:mortoneccc@comcast.net> wrote:
> > >>>
> > >>>
> > >>> Someone on StackExchange was asking about their gawk script being
> > >>> slow
> > >>> and someone else (https://unix.stackexchange.com/a/771263/133219)
> > >>> pointed out that using `-i inplace` is an order of magnitude slower
> > >>> if
> > >>> you don't also redirect stdout which seems unintuitive at best.
> > >>>
> > >>> For example given a 1 million line input file created by:
> > >>>
> > >>> $ seq 1000000 > file1m
> > >>>
> > >>> and using:
> > >>>
> > >>> $ awk --version
> > >>> GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP
> > >>> 6.3.0)
> > >>>
> > >>> If we just reproduce it as-is using `-i inplace` the timing is:
> > >>>
> > >>> $ time { awk -i inplace '1' file1m; }
> > >>>
> > >>> real 0m2.544s
> > >>> user 0m0.265s
> > >>> sys 0m1.843s
> > >>>
> > >>> whereas if we redirect stdout even though there is no stdout
> > >>> produced:
> > >>>
> > >>> $ time { awk -i inplace '1' file1m >/dev/null; }
> > >>>
> > >>> real 0m0.236s
> > >>> user 0m0.187s
> > >>> sys 0m0.000s
> > >>>
> > >>> As you can see that second execution with stdout redirected ran an
> > >>> order
> > >>> of magnitude faster. The person who investigated thinks it's due to
> > >>> the
> > >>> first execution being considered "interactive" since stdout isn't
> > >>> technically being redirected and so doing line buffering vs the
> > >>> second
> > >>> execution being "non-interactive" due to stdout being redirected
> > >>> and so
> > >>> using a larger buffer.
> > >>>
> > >>> If that is the case, could gawk be updated to consider "inplace"
> > >>> editing
> > >>> as non-interactive? If not, I think it'd be worth a statement in the
> > >>> manual about this difference in performance between the 2.
> > >>>
> > >>> Ed.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >
- gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, arnold, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout,
arnold <=