[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large
From: |
Jason C. Kwan |
Subject: |
Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs |
Date: |
Wed, 26 Jan 2022 22:35:22 +0000 (UTC) |
Hi GAWK team,
Not sure if I should be reporting this to the gawk team or the GnuMP team -
it's neither a bug or nor new feature per se, but merely a performance one -
I've noticed on that for extremely large inputs, say, integers with more than 5
million digits, the bignum gawk -M mode could be somewhat slow, with possible
room for speed improvement (proof-of-concept attached below). my gawk version
information :
GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28
20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
On my test-case of 71.5 million digits, it's possible to to save 63.7% of time
, even in regular gawk, when compared to gawk -M :
==================
gawk -be '{ print length($0) }' test.txt71591842
test command ::
echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }' ) |
xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{ print
$0+$0 }' ) | xxh128sum ; echo
a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C LC_ALL=C
gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 totalxxh128sum
0.01s user 0.00s system 0% cpu 8.789 total
a56c2d2302d9ea8751f810b848e6f354 stdin( pv -q < test.txt | LC_ALL=C gawk -M -b
-e ; ) 23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum 0.01s user
0.00s system 0% cpu 24.239 total
=====================
In another test case of slightly over 275 milion digits, the time savings are
71.7% :
fc2231bdff375b7870586d8dffc0841c stdin( pv -q < jwengowengonoewgnwoegn.txt |
LC_ALL=C gawk -b -e "${mfx}" -e ; ) 27.25s user 6.48s system 98% cpu 34.350
totalxxh128sum 0.04s user 0.02s system 0% cpu 34.349 total
fc2231bdff375b7870586d8dffc0841c stdin( pv -q < jwengowengonoewgnwoegn.txt |
LC_ALL=C gawk -M -b -e ; ) 116.58s user 4.78s system 99% cpu 2:01.42
totalxxh128sum 0.04s user 0.02s system 0% cpu 2:01.42 total
====================
Attached below is the full proof-of-concept code for function _2x_( ) to
demostrate that the time savings are very much concrete and possible, not
simply theoretical. The test file, being 70MB+, is a big large for email, but
basically any file using ASCII digits 0-9 to represent any integer over 5
million digits will do. The speed difference isn't noticeable for smaller
inputs, and for inputs fewer than 7 digits, most likely it would be slower than
gawk -M.
I tried maximizing portability of the proof-of-concept function by refraining
from any gawk-specific extensions of awk - this same code has also been tested
in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk.
It's entirely self-contained, performs no recursion, has no external
dependencies, no bit-wise ops, and doesn't include any advanced/fancy math -
just straight up grade-school long-form addition, doubling them 15-digits per
chunk, and 2 chunks per while loop. The carrying part is performed by gsub( )
prior to the while( ) loop, and thus, eliminating the need to track them along
the way. Performance scales linearly at the log-base-10 level.
( I didn't include any copyright notice or credits to a priori since I don't
think grade school addition is something copyrightable)
Obviously this is simply awk scripting code and can't be directly incorporated
into the C code base - I'm merely raising awareness of the issue.
Thanks for your time.
Jason
===================================================================================
( I can reformat the code for readability if you prefer ) :
function _2x_(__,_,_______,____,_________,
___________,__________,____________, ________,_____,______,___) {
if(__!~/[1-9]/) { return +_ } ___=(__~/^[-]/)
sub("^[+-]?["(-"")"]*","",__) if (length(__)<(_____=((_+=++_)^_^_-!!_))+_) {
if (_________^((_____+(_~____))^(_____-_)<+__)) { return
(-!-"")^___*(+__+__) } } ___=substr("-",_~"",___) if (__!~"[5-9]") {
gsub(/4/,"8",__)-gsub(/3/,"6",__) gsub(/2/,"4",__)-gsub(/1/,"2",__)
return (___)__ }; _______=____=________="";
__=(_____=substr(________=(_++^_--+_)^_____,_))(__) gsub(/./,".",_____)
sub("("(_____)")+$","_&",__) __=substr(__,index(__,"_")+!+"")
________+=+________;_=""; if
((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {
_=(":")(________+__+__) __=substr(__,index(__,"_")+!+"") }
for(_____ in ______) { } if (______[_____=split(__,______,/[_]/)]=="") {
delete ______[_____--] };__=____~____; __________=____________*=\
____________=___________=_________=32; while(__<_____) { if(!--__________){
__________=____________;_______=(_______)_;_="";
if(!--___________){ ___________=_________;____=(____)_______;_______="";
} } _=(_)(":")(+______[__++]*2+________)\
(":")(+______[__++]*2+________) } ____=(____)(_______)(_)\
(__==_____?(":")(________+2*______[_____]):"") gsub(/[:]+[^:]/,"",____)
sub(/^0*/,"",____); return (___)____ }
- Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs,
Jason C. Kwan <=
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Jason C. Kwan, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Andrew J. Schorr, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, Wolfgang Laun, 2022/01/27
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/27
- Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, arnold, 2022/01/28
- RE: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs, pjfarley3, 2022/01/28