[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely la
From: |
arnold |
Subject: |
Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs |
Date: |
Wed, 26 Jan 2022 23:42:07 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hello.
Thank you for taking the time to report an issue.
Please read the instructions for bug reporting at
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
It was updated recently, please reread it if you haven't looked at
it in a long time.
I'm afraid that your "proof of concept" code is so unreadable that
I won't even try to determine what it's doing.
With respect to GMP performance, I suggest that you report an
issue directly to the GMP developers. The gawk code that uses
GMP isn't going to change. For this reason, I don't think you
need to bother redoing your example code, unless you want to send
it to the GMP developers.
Thanks,
Arnold
"Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:
> Hi GAWK team,
> Not sure if I should be reporting this to the gawk team or the GnuMP team -
> it's neither a bug or nor new feature per se, but merely a performance one -
> I've noticed on that for extremely large inputs, say, integers with more than
> 5 million digits, the bignum gawk -M mode could be somewhat slow, with
> possible room for speed improvement (proof-of-concept attached below). my
> gawk version information :
> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
> Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28
> 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
> On my test-case of 71.5 million digits, it's possible to to save 63.7% of
> time , even in regular gawk, when compared to gawk -M :
> ==================
> gawk -be '{ print length($0) }' test.txt71591842
> test command ::
> echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }' )
> | xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{ print
> $0+$0 }' ) | xxh128sum ; echo
> a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C LC_ALL=C
> gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 totalxxh128sum
> 0.01s user 0.00s system 0% cpu 8.789 total
> a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C gawk -M
> -b -e ; ) 23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum 0.01s user
> 0.00s system 0% cpu 24.239 total
> =====================
> In another test case of slightly over 275 milion digits, the time savings are
> 71.7% :
> fc2231bdff375b7870586d8dffc0841c stdin( pv -q < jwengowengonoewgnwoegn.txt |
> LC_ALL=C gawk -b -e "${mfx}" -e ; ) 27.25s user 6.48s system 98% cpu 34.350
> totalxxh128sum 0.04s user 0.02s system 0% cpu 34.349 total
> fc2231bdff375b7870586d8dffc0841c stdin( pv -q < jwengowengonoewgnwoegn.txt |
> LC_ALL=C gawk -M -b -e ; ) 116.58s user 4.78s system 99% cpu 2:01.42
> totalxxh128sum 0.04s user 0.02s system 0% cpu 2:01.42 total
> ====================
> Attached below is the full proof-of-concept code for function _2x_( ) to
> demostrate that the time savings are very much concrete and possible, not
> simply theoretical. The test file, being 70MB+, is a big large for email, but
> basically any file using ASCII digits 0-9 to represent any integer over 5
> million digits will do. The speed difference isn't noticeable for smaller
> inputs, and for inputs fewer than 7 digits, most likely it would be slower
> than gawk -M.
> I tried maximizing portability of the proof-of-concept function by refraining
> from any gawk-specific extensions of awk - this same code has also been
> tested in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk.
> It's entirely self-contained, performs no recursion, has no external
> dependencies, no bit-wise ops, and doesn't include any advanced/fancy math -
> just straight up grade-school long-form addition, doubling them 15-digits per
> chunk, and 2 chunks per while loop. The carrying part is performed by gsub( )
> prior to the while( ) loop, and thus, eliminating the need to track them
> along the way. Performance scales linearly at the log-base-10 level.
> ( I didn't include any copyright notice or credits to a priori since I don't
> think grade school addition is something copyrightable)
> Obviously this is simply awk scripting code and can't be directly
> incorporated into the C code base - I'm merely raising awareness of the issue.
> Thanks for your time.
> Jason
> ===================================================================================
> ( I can reformat the code for readability if you prefer ) :
> function _2x_(__,_,_______,____,_________,
> ___________,__________,____________, ________,_____,______,___)
> { if(__!~/[1-9]/) { return +_ } ___=(__~/^[-]/)
> sub("^[+-]?["(-"")"]*","",__) if (length(__)<(_____=((_+=++_)^_^_-!!_))+_)
> { if (_________^((_____+(_~____))^(_____-_)<+__)) { return
> (-!-"")^___*(+__+__) } } ___=substr("-",_~"",___) if (__!~"[5-9]") {
> gsub(/4/,"8",__)-gsub(/3/,"6",__)
> gsub(/2/,"4",__)-gsub(/1/,"2",__) return (___)__ };
> _______=____=________="";
> __=(_____=substr(________=(_++^_--+_)^_____,_))(__) gsub(/./,".",_____)
> sub("("(_____)")+$","_&",__) __=substr(__,index(__,"_")+!+"")
> ________+=+________;_=""; if
> ((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {
> _=(":")(________+__+__) __=substr(__,index(__,"_")+!+"") }
> for(_____ in ______) { } if (______[_____=split(__,______,/[_]/)]=="") {
> delete ______[_____--] };__=____~____; __________=____________*=\
> ____________=___________=_________=32; while(__<_____) {
> if(!--__________){ __________=____________;_______=(_______)_;_="";
> if(!--___________){
> ___________=_________;____=(____)_______;_______=""; } }
> _=(_)(":")(+______[__++]*2+________)\
> (":")(+______[__++]*2+________) } ____=(____)(_______)(_)\
> (__==_____?(":")(________+2*______[_____]):"") gsub(/[:]+[^:]/,"",____)
> sub(/^0*/,"",____); return (___)____ }
- Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/26
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs,
arnold <=
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Wolfgang Laun, 2022/01/27
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/28
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/28
- Use of !!, arnold, 2022/01/29