[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely la
From: |
arnold |
Subject: |
Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs |
Date: |
Thu, 27 Jan 2022 06:22:30 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
If I can't understand the awk code I can't begin to figure out if
the problem is in gawk or elsewhere.
There is no "I/O" between gawk and GMP, it's all binary data structures
passed around in memory.
"Jason C. Kwan" <jasonckwan@yahoo.com> wrote:
> Have you tried even just straight up pasting it and run it as is ? when
> interpreted scripting for the same platform is generating some 65% reduction
> in execution time versus the built in compiled binary from C code source,
> don’t you think it should at least warrant a quick look to see if there are
> potential I/O bottlenecks between gawk and the GMP add-on module it uses ?
>
> i already used LC_ALL=C locale, per the bug reporting page, as you could see
> below. I even let gawk -M go second so if there were any gains from caching,
> it would be gawk -M that benefited from it. Maybe it’s an ARM thing, being
> that I’m on m1max instead of x64
>
> I only sent here first because I’m somewhat certain the GMP side would
> instantly send me back here, seeing that I couldn’t isolate out whether the
> potential bottlenecks exists on gawk side or gmp side, and I haven’t written
> in C since 2002. I even tried reading the gawk 5.1.1 source code to see if I
> could identify anything that could be help, but haven’t been lucky.
>
> Do u want me to read through the gmp source first ?
>
> Regards,
> Jason K
>
> > arnold@skeeve.com於2022年1月27日 01:42寫道:
> >
> > Hello.
> >
> > Thank you for taking the time to report an issue.
> >
> > Please read the instructions for bug reporting at
> > https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> > It was updated recently, please reread it if you haven't looked at
> > it in a long time.
> >
> > I'm afraid that your "proof of concept" code is so unreadable that
> > I won't even try to determine what it's doing.
> >
> > With respect to GMP performance, I suggest that you report an
> > issue directly to the GMP developers. The gawk code that uses
> > GMP isn't going to change. For this reason, I don't think you
> > need to bother redoing your example code, unless you want to send
> > it to the GMP developers.
> >
> > Thanks,
> >
> > Arnold
> >
> > "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:
> >
> >> Hi GAWK team,
> >> Not sure if I should be reporting this to the gawk team or the GnuMP team
> >> - it's neither a bug or nor new feature per se, but merely a performance
> >> one - I've noticed on that for extremely large inputs, say, integers with
> >> more than 5 million digits, the bignum gawk -M mode could be somewhat
> >> slow, with possible room for speed improvement (proof-of-concept attached
> >> below). my gawk version information :
> >> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
> >> Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28
> >> 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
> >> On my test-case of 71.5 million digits, it's possible to to save 63.7% of
> >> time , even in regular gawk, when compared to gawk -M :
> >> ==================
> >> gawk -be '{ print length($0) }' test.txt71591842
> >> test command ::
> >> echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }'
> >> ) | xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{
> >> print $0+$0 }' ) | xxh128sum ; echo
> >> a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C
> >> LC_ALL=C gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789
> >> totalxxh128sum 0.01s user 0.00s system 0% cpu 8.789 total
> >> a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C gawk
> >> -M -b -e ; ) 23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum
> >> 0.01s user 0.00s system 0% cpu 24.239 total
> >> =====================
> >> In another test case of slightly over 275 milion digits, the time savings
> >> are 71.7% :
> >> fc2231bdff375b7870586d8dffc0841c stdin( pv -q <
> >> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -b -e "${mfx}" -e ; ) 27.25s
> >> user 6.48s system 98% cpu 34.350 totalxxh128sum 0.04s user 0.02s system
> >> 0% cpu 34.349 total
> >> fc2231bdff375b7870586d8dffc0841c stdin( pv -q <
> >> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -M -b -e ; ) 116.58s user
> >> 4.78s system 99% cpu 2:01.42 totalxxh128sum 0.04s user 0.02s system 0%
> >> cpu 2:01.42 total
> >> ====================
> >> Attached below is the full proof-of-concept code for function _2x_( ) to
> >> demostrate that the time savings are very much concrete and possible, not
> >> simply theoretical. The test file, being 70MB+, is a big large for email,
> >> but basically any file using ASCII digits 0-9 to represent any integer
> >> over 5 million digits will do. The speed difference isn't noticeable for
> >> smaller inputs, and for inputs fewer than 7 digits, most likely it would
> >> be slower than gawk -M.
> >> I tried maximizing portability of the proof-of-concept function by
> >> refraining from any gawk-specific extensions of awk - this same code has
> >> also been tested in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk.
> >> It's entirely self-contained, performs no recursion, has no external
> >> dependencies, no bit-wise ops, and doesn't include any advanced/fancy math
> >> - just straight up grade-school long-form addition, doubling them
> >> 15-digits per chunk, and 2 chunks per while loop. The carrying part is
> >> performed by gsub( ) prior to the while( ) loop, and thus, eliminating the
> >> need to track them along the way. Performance scales linearly at the
> >> log-base-10 level.
> >> ( I didn't include any copyright notice or credits to a priori since I
> >> don't think grade school addition is something copyrightable)
> >> Obviously this is simply awk scripting code and can't be directly
> >> incorporated into the C code base - I'm merely raising awareness of the
> >> issue.
> >> Thanks for your time.
> >> Jason
> >> ===================================================================================
> >> ( I can reformat the code for readability if you prefer ) :
> >> function _2x_(__,_,_______,____,_________,
> >> ___________,__________,____________,
> >> ________,_____,______,___) { if(__!~/[1-9]/) { return +_ }
> >> ___=(__~/^[-]/) sub("^[+-]?["(-"")"]*","",__) if
> >> (length(__)<(_____=((_+=++_)^_^_-!!_))+_) { if
> >> (_________^((_____+(_~____))^(_____-_)<+__)) { return
> >> (-!-"")^___*(+__+__) } } ___=substr("-",_~"",___) if
> >> (__!~"[5-9]") { gsub(/4/,"8",__)-gsub(/3/,"6",__)
> >> gsub(/2/,"4",__)-gsub(/1/,"2",__) return (___)__ };
> >> _______=____=________="";
> >> __=(_____=substr(________=(_++^_--+_)^_____,_))(__) gsub(/./,".",_____)
> >> sub("("(_____)")+$","_&",__) __=substr(__,index(__,"_")+!+"")
> >> ________+=+________;_=""; if
> >> ((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {
> >> _=(":")(________+__+__) __=substr(__,index(__,"_")+!+"") }
> >> for(_____ in ______) { } if (______[_____=split(__,______,/[_]/)]=="")
> >> { delete ______[_____--] };__=____~____;
> >> __________=____________*=\ ____________=___________=_________=32;
> >> while(__<_____) { if(!--__________){
> >> __________=____________;_______=(_______)_;_="";
> >> if(!--___________){
> >> ___________=_________;____=(____)_______;_______=""; } }
> >> _=(_)(":")(+______[__++]*2+________)\
> >> (":")(+______[__++]*2+________) } ____=(____)(_______)(_)\
> >> (__==_____?(":")(________+2*______[_____]):"") gsub(/[:]+[^:]/,"",____)
> >> sub(/^0*/,"",____); return (___)____ }
>
- Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/26
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs,
arnold <=
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Wolfgang Laun, 2022/01/27
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/28
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/28
- Use of !!, arnold, 2022/01/29