avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Printing octal (brain dump)


From: George Spelvin
Subject: Re: [avr-libc-dev] Printing octal (brain dump)
Date: 21 Dec 2016 18:46:17 -0500

Georg-Johann Lay wrote:
> oh, I was just misreading this table and thought that it means yet
> another 200 bytes atop of ultoa_invert (just demonstrating that
> it isn't worse than ultoa_invert).
>
> But it appears you are intending to drop ultoa_invert which is great!

*Whew*!  No, I'm comparing it because it's a *replacement* for
ultoa_invert.o.  I wanted smaller *and* faster *and* arbitrary-length.

I was feeling very deflated by your complaints.

> Really funny results (just noticed you are using TABs and therefore
> the whole table is rigged ;-))

Here is the updated table (which consistently omits the string reverse)
in a form that can take one level of quoting:

Input   Decimal                         Hex
bits    mem_toa mem_tod itoa    nibbles mem_toa itoa
 8       269     220     217     141     194      98
16       664     462     451     321     527     187
24      1294     783     838     608    1008     276
32      2059    1219    1167     948    1637     365
40      3059    1775    1649    1395    2414     454
48      4194    2373    2127    1895    3339     543
56      5477    3069    2733    2459    4412     632
64      6995    3822    3420    3130    5633     721

For binary bases, yu're seeing here the difference between an O(n)
algorithm (89n+9 cycles) and an O(n^2) one (74*n^2 + 111*n + 9).
I'm really not happy with that.

For decimal, however mem_tod without a multiplier is almost fast as my
decimal code *with* one, and is 80 bytes long.  That's the code I'd like
to use on multiplierless machines.  mem_toa is 64 bytes, but the almost
2x speed difference is worth the 16 bytes, IMHO.

Your u64toa_nibbles code (after I tweaked it a bit) is 90 bytes and the
fastest of all.  There, it's only about 75% the time of the multiplierless
code, so whether the speed is worth it is more of a question.

One thing that justfies it to me is that the enhanced cores tend
to come with more flash, so an additional 10 bytes is affordable.
Also contributing to eh afforadility is that the enhanced cores tend to
generate smaller code overall.

On the other hand, maintaining two completely different code paths is
a bother.  There's a lot to be said for just one.


One alternative I mentioned earlier, which I'm thinking seriously about,
is to reorganize the code into two phases:

1) Convert decimal and octal to little-endian BCD.
   (%x would just find the length.)
2) Print little-endian hex as ASCII.

That would enlarge the code somewhat, but reduce stack usage by 11 bytes.
As I noted, ROM:RAM is usually 16:1 so I could argue that those 16 bytes
of RAM are "worth" 176 bytes of code, which is far more than code size
increase.

Note that mem_tod produces digits two at a time anyway, so it's a
natural fit.  I have some ideas for how to adapt the u64toa_nibbles code,
and octal wouldn't be too hard.

I'd really appreciate your opinion of the idea.

> Often programmers are bitten by their smartness when they observe
> that avr-gcc generates "some" extra instructions around the core
> 64-bit arithmetic.  But that's a different story...

I don't quite know what you're alluding to.

What frustrates me abut avr-gcc is code like time/gm_sidereal.c,
where it's doing a 32x32->64 bit multiply, but if !__AVR_HAVE_MUL__,
then gcc "knows" that there's no __umulsidi3 function, and generates an
absolutely massive spill & fill sequence to call __muldi3, which ends
up being a lot bigger & slower than a call to the __umulsidi3 wrapper
which totally exists.

>> That saves 90 cycles, taking it to 7206.
>
> Just a small speed-up, but really cool idea :-)

You had some awfully cool ideas yourself, particularly integrating the
length-finding into the main loop.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]