[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Help-gawk Digest, Vol 1, Issue 3
From: |
J Naman |
Subject: |
Re: Help-gawk Digest, Vol 1, Issue 3 |
Date: |
Mon, 19 Jul 2021 13:58:38 -0400 |
BTW: I benchmarked time for sprintf()+gsub()= 16.15 secs for 1,000 loops,
more than one HUNDRED times slower than time for doubling & substr(loop!)=
12.36 secs for 100,000 loops. (unless my benchmark code had a bug ...)
Someone mentioned a "bug" in gsub() ...
On Mon, Jul 19, 2021 at 4:24 AM <help-gawk-request@gnu.org> wrote:
> Send Help-gawk mailing list submissions to
> help-gawk@gnu.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.gnu.org/mailman/listinfo/help-gawk
> or, via email, send a message with subject or body 'help' to
> help-gawk-request@gnu.org
>
> You can reach the person managing the list at
> help-gawk-owner@gnu.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Help-gawk digest..."
>
>
> Today's Topics:
>
> 1. Why string can be added with 0? (Peng Yu)
> 2. Re: Why string can be added with 0? (Neil R. Ormos)
> 3. Re: Why string can be added with 0? (Bob Proulx)
> 4. Re: How to Generate a Long String of the Same Character
> (Bob Proulx)
> 5. Re: How to Generate a Long String of the Same Character
> (Neil R. Ormos)
> 6. Re: Why string can be added with 0? (Wolfgang Laun)
> 7. Re: How to Generate a Long String of the Same Character
> (Wolfgang Laun)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 18 Jul 2021 21:41:16 -0500
> From: Peng Yu <pengyu.ut@gmail.com>
> To: help-gawk@gnu.org
> Subject: Why string can be added with 0?
> Message-ID:
> <CABrM6w=xSPGzqU=bExg8_ujO7ycDtuY8T6jKcGk4S=
> bAvdcUwA@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
>
> $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> string 0
>
> But it seems that there should be an error to add a string to 0? Is it
> better to show some error instead of assuming a string as 0 in the
> context of arithmetic operations? Thanks.
>
> --
> Regards,
> Peng
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 18 Jul 2021 23:28:55 -0500 (CDT)
> From: "Neil R. Ormos" <ormos-gnulists17@ormos.org>
> To: Help Gawk List <help-gawk@gnu.org>
> Subject: Re: Why string can be added with 0?
> Message-ID: <Pine.GSO.4.64.2107182252020.27912@shell3.ripco.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
> Peng Yu wrote:
>
> > I see this. I don't find anything about it in
> > 6.2.1 Arithmetic Operators.
>
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
>
> > But it seems that there should be an error to
> > add a string to 0?
>
> It is not an error.
>
> | 6.1.4.1 How awk Converts Between Strings and Numbers
>
> | Strings are converted to numbers and numbers are
> | converted to strings, if the context of the awk
> | program demands it. For example, if the value of
> | either foo or bar in the expression 'foo + bar'
> | happens to be a string, it is converted to a
> | number before the addition is performed. [...]
>
> | [...] To force a string to be converted to a
> | number, add zero to that string. A string is
> | converted to a number by interpreting any
> | numeric prefix of the string as numerals: [...]
> | Strings that can't be interpreted as valid
> | numbers convert to zero.
>
> > Is it better to show some error instead of
> > assuming a string as 0 in the context of
> > arithmetic operations?
>
> No. Awk's behavior, when an arithmetic operation
> involving a string is attempted, of interpreting
> as numeric however much of the string appears to
> be numeric, without noisy error messages, is a
> feature that makes it easier to write concise
> programs that handle mixed string and numeric
> input.
>
> Besides, it has worked that way for eons, and
> programmers rely on it. Changing it now would
> break zillions of working programs.
>
> If you prefer a programming language that does
> intrusive type checking or routinely changes out
> from under you so existing programs are rendered
> useless, there are plenty of choices.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 18 Jul 2021 22:35:06 -0600
> From: Bob Proulx <bob@proulx.com>
> To: help-gawk@gnu.org
> Subject: Re: Why string can be added with 0?
> Message-ID: <20210718221845320319290@bob.proulx.com>
> Content-Type: text/plain; charset=us-ascii
>
> Peng Yu wrote:
> > I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
>
> You were very close to the best section of the manual. In the section
> before that one is where the gawk manual talks about strings and
> numbers. In my manual it is section 6.1.4.1 "How 'awk' Converts
> Between Strings and Numbers". The answer you seek is there.
>
>
> https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html
>
> 6.1.4.1 How 'awk' Converts Between Strings and Numbers
> ......................................................
>
> Strings are converted to numbers and numbers are converted to strings,
> if the context of the 'awk' program demands it. For example, if the
> value of either 'foo' or 'bar' in the expression 'foo + bar' happens to
> be a string, it is converted to a number before the addition is
> performed. If numeric values appear in string concatenation, they are
> converted to strings. Consider the following:
>
> two = 2; three = 3
> print (two three) + 4
>
> This prints the (numeric) value 27. The numeric values of the
> variables
> 'two' and 'three' are converted to strings and concatenated together.
> The resulting string is converted back to the number 23, to which 4 is
> then added.
>
> If, for some reason, you need to force a number to be converted to a
> string, concatenate that number with the empty string, '""'. To force
> a
> string to be converted to a number, add zero to that string. A string
> is converted to a number by interpreting any numeric prefix of the
> string as numerals: '"2.5"' converts to 2.5, '"1e3"' converts to 1,000,
> and '"25fix"' has a numeric value of 25. Strings that can't be
> interpreted as valid numbers convert to zero.
>
> I abbreviated the information here. See the manual for the full
> section with more detail than I included here.
>
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
>
> That's correct. It's a string but then adding 0 to the string forces
> it to be a number.
>
> > But it seems that there should be an error to add a string to 0? Is it
> > better to show some error instead of assuming a string as 0 in the
> > context of arithmetic operations? Thanks.
>
> AWK was one of the first of the little languages to try to dynamically
> do the right thing to simplify the programmer's task of writing a
> program. But following in the tradition of AWS is also Perl, Python,
> Ruby, and many other dynamic languages that all behave the same way.
> It's a design paradigm used to enhance programmer productivity. If it
> is used like a string then it is converted to a string. If it is used
> like a number then it is converted to a number.
>
> Bob
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 18 Jul 2021 22:59:53 -0600
> From: Bob Proulx <bob@proulx.com>
> To: help-gawk@gnu.org
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID: <20210718224110340130477@bob.proulx.com>
> Content-Type: text/plain; charset=us-ascii
>
> Neil R. Ormos wrote:
> > In a message on the bug-gawk list, Ed Mortin wrote:
> > That should have been "Ed Morton".
> > > On an online forum someone asked how to generate a
> > > string of 100,000,000 "x"s. They had tried this in
> > > a BEGIN section:
> > >
> > > for(i=1;i<=100000000;i++) s = s "x"
> >...
> > Building a big string by iterating in tiny chunks
> > would seem to invite poor performance.
>
> Agreed. Growing by one character at a time definitely seems
> inefficient.
>
> > Instead, why not append the string to itself,
> > doubling its size with each iteration? For
> > example:
> >
> > time ~/.local/bin/gawk-5.1.0 \
> > 'BEGIN{sizelim=100000000; a="x"; while (length(a) < sizelim) {a=a a};
> a=substr(a, 1, sizelim); print length(a);}'
>
> I think that is probably one of the best ways with awk.
>
> My mind first thought that it would be better to produce a file that
> contained 100 million "x"s and then read it into awk.
>
> awk '{print length($0)}' < bigfileofx
>
> Of course that simply changes the problem around to creating that
> file! This is rather a silly response but it's fun just the same.
>
> Well... There are certainly many ways to do it. I would use dd for
> creating the byte stream of the right size. But there seems no way to
> use dd to produce "x" characters. But it can read /dev/zero okay.
> And tr can translate zeros to other characters such as an "x".
>
> $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x"; echo
> xxxxxxxxxx
>
> $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x" | wc -c
> 10
>
> That looks promising. Let's fire it up for the requested 100 million
> size.
>
> $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" | wc
> -c
>
> 104857600
>
> real 0m0.179s
> user 0m0.126s
> sys 0m0.167s
>
> Looks like the right size. Let's get it into awk.
>
> $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" |
> awk '{print length($0)}'
> 104857600
>
> real 0m0.624s
> user 0m0.451s
> sys 0m0.398s
>
> That's looking pretty good. Let's compare it against the reference
> above so one can see how slow my machine is about such things.
>
> $ time awk 'BEGIN{sizelim=100000000; a="x"; while (length(a) <
> sizelim) {a=a a}; a=substr(a, 1, sizelim); print length(a);}'
>
> 100000000
>
> real 0m1.469s
> user 0m0.815s
> sys 0m0.654s
>
> I am running this on an older Intel Core i5 CPU 750 2.67GHz.
>
> > On my not-very-fast machine, according to the time
> > built-in, that takes 0.17 seconds of elapsed time.
>
> Faster than my daily driving desktop! :-)
>
> > Yes, worst-case, if the intended string has length
> > (2^N)+1, you wastefully build a string of size
> > 2^(N+1) and trim off almost half. So maybe on
> > some machines, building the string in
> > single-character units would work but the doubling
> > would not.
>
> Fun stuff! And illustrates the usefulness of benchmarking to collect
> data.
>
> Bob
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 19 Jul 2021 01:46:40 -0500 (CDT)
> From: "Neil R. Ormos" <ormos-gnulists17@ormos.org>
> To: Help Gawk List <help-gawk@gnu.org>
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID: <Pine.GSO.4.64.2107190129410.3912@shell3.ripco.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
> Bob Proulx wrote:
>
> > That's looking pretty good. Let's compare it against the reference
> > above so one can see how slow my machine is about such things.
> >
> > $ time awk 'BEGIN{sizelim=100000000; a="x"; while (length(a) <
> sizelim) {a=a a}; a=substr(a, 1, sizelim); print length(a);}'
>
> > 100000000
> >
> > real 0m1.469s
> > user 0m0.815s
> > sys 0m0.654s
> >
> > I am running this on an older Intel Core i5 CPU 750 2.67GHz.
>
> That seems really odd. It takes under 0.5 seconds
> of elapsed time on a machine with a 25-watt mobile
> Core 2 Duo CPU that maxes out at 2.26 GHz.
>
> I tried your dd | tr | gawk solution and found the
> times vary bizarrely on machines where the pure
> gawk solution has run-times roughly in-line with
> what I'd expect. Even the elapsed times of
> consecutive individual runs of the dd | tr | gawk
> solution vary strangely.
>
> Also, I think the blocksize parameter should be
> bs=1MB to get blocks of 10^6 bytes and not 2^20
> bytes.
>
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 19 Jul 2021 06:07:44 +0200
> From: Wolfgang Laun <wolfgang.laun@gmail.com>
> To: Peng Yu <pengyu.ut@gmail.com>
> Cc: help-gawk@gnu.org
> Subject: Re: Why string can be added with 0?
> Message-ID:
> <
> CANaj1LfpCAA9k_KomStu0mB9O2AZ72yzJP7oja7sZrv12spEaQ@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> See 6.1.4.1 How awk Converts Between Strings and Numbers.
> There are languages where the operator defines the kind of operation, and
> languages where the type of the argument decides what to do.
> If there can be doubts as to the correctness of the data, check.
>
> -W
>
>
>
> On Mon, 19 Jul 2021 at 05:36, Peng Yu <pengyu.ut@gmail.com> wrote:
>
> > I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
> >
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
> >
> > But it seems that there should be an error to add a string to 0? Is it
> > better to show some error instead of assuming a string as 0 in the
> > context of arithmetic operations? Thanks.
> >
> > --
> > Regards,
> > Peng
> >
> >
>
> --
> Wolfgang Laun
>
>
> ------------------------------
>
> Message: 7
> Date: Mon, 19 Jul 2021 08:51:24 +0200
> From: Wolfgang Laun <wolfgang.laun@gmail.com>
> To: "Neil R. Ormos" <ormos-gnulists17@ormos.org>, help-gawk@gnu.org
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID:
> <
> CANaj1Ldm_NOWCreZKSiAcehW0Z-kz6jkRYkMCkzAtrK_fbgV2Q@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Neil R. Ormos suggests the following code, which I put here as a function:
> function srep(n, s){
> while( length(s) < n )
> s = s s;
> return substr( s, 1, n );
> }
> Neil points out that doubling in the while loop may overshoot the desired
> length by almost 100%, potentially causing the algorithm to fail. However,
> it is quite simple to avoid this:
> function srep(n, s){ # *dbl*
> while( length(s)*2 <= n )
> s = s s;
> return s substr( s, 1, n - length(s) );
> }
>
> I have tried to keep track of all the solutions to the simple original
> problem, extending the functionality to string repetition (because this
> makes it more useful), and done some performance testing.
>
> The original question was whether this:
> function srep(n, s, res){ # *rpt*
> for( i = 1; i <= n; ++i )
> res = res s
> return res;
> }
> could be improved. This was proposed as an improvement over the *rpt*
> version:
> function srep(n, s, res){ # *sub*
> res = sprintf("%*s", n, "");
> gsub( / /, s, res );
> return res;
> }
> and I contributed:
> function srep(n, s, h){ # *rec*
> if( n == 0 ) return "";
> h = srep( int(n/2), s )
> return n % 2 == 1 ? h h s : h h;
> }
>
> I have used this code together with /usr/bin/time:
> BEGIN {
> for( j = 1; j <= 300000; ++j ){
> srep( j%1000, "a" );
> srep( j%1000, "abcde" );
> }
> }
> The results for the four versions:
> *rec* 0m1,436s
> *dbl* 0m2.322s
> *rpt* 0m13.543s
> *sub* 0m27.290s
>
> Note 1: It should be noted that version *sub* has a defect: using "&" or
> some combination with "\" is not handled correctly. I have read section
> 9.1.3.1, *More about ‘\’ and ‘&’ with sub(), gsub(), and gensub(), *of the
> GUM and, although it didn't cause me a headache, it made me gawk. I did not
> try to cook *sub.*
>
> Note 2: I have provoked the aforementioned failure in *dbl*, resulting in
> the somewhat laconic error message:
> $ gawk -f srepDoubl.awk
> Killed
> See the bug list for my comment on this message.
>
> Cheers
> Wolfgang
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Help-gawk mailing list
> Help-gawk@gnu.org
> https://lists.gnu.org/mailman/listinfo/help-gawk
>
>
> ------------------------------
>
> End of Help-gawk Digest, Vol 1, Issue 3
> ***************************************
>
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Help-gawk Digest, Vol 1, Issue 3,
J Naman <=