[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-libc-dev] Printing octal (brain dump)
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-libc-dev] Printing octal (brain dump) |
Date: |
Wed, 21 Dec 2016 15:17:59 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
On 19.12.2016 23:05, George Spelvin wrote:
Georg-Johann Lay wrote:
George Spelvin schrieb:
It is unfortunately one instruction longer than ultoa_invert:
text data bss dec hex filename
188 0 0 188 bc ultoa_invert.o
190 0 0 190 be itoa_engine.o
208 0 0 208 d0 itoa_engine.o (+OPTIMIZE_SPEED)
212 0 0 212 d4 itoa_engine.o (without HAVE_MUL)
hmmm, that's quite some increase in code size...
Um... just how serious are you being? I didn't think two bytes was
"quite some increase". But compared to your alternatives... sigh,
I may have been wasting my time. :-(
oh, I was just misreading this table and thought that it means yet
another 200 bytes atop of ultoa_invert (just demonstrating that
it isn't worse than ultoa_invert).
But it appears you are intending to drop ultoa_invert which is great!
IMO octal conversion is so remote from any practical purposes that we
should use an algorithm that works for all radices.
Well, I sort of did that. I did one that works for all power-of-two
sizes, and used it for both hex and octal.
Hex alone can be done much faster, but it isn't bad with the more
generic code.
...but you don't mean limited to powers of 2. Just plain old binary long
division. As I said when I started this, it's hard to escape certain
optimization habits.
This can be done in less than 70 bytes (variable length / radix
algorithm that operates on memory, no need for MUL, needs strrev.
H'm... let me have a poke at that.
The actual code changes are at the end, but I found 90 cycles
in one place, and 91 cycles + 1 instruction in another.
With those changes, here's a timing table, for values of the from 2^k - 1:
Decimal Hex mod100
Input mem_toa itoa mem_toa itoa mem_toa
8 bit 293 217 194 98 194
16 bit 700 451 527 187 440
24 bit 1342 838 1008 276 748
32 bit 2119 1167 1637 365 1142
40 bit 3143 1649 2414 454 1697
48 bit 4290 2127 3339 543 2301
56 bit 5585 2733 4412 632 2991
64 bit 7115 3420 5633 721 3743
Really funny results (just noticed you are using TABs and therefore
the whole table is rigged ;-))
That's really attractive for decimal. Also, for not much extra code,
we could do the long division in base 100 (timing for only this part is
above), and then split that into 2 digits.
For hex, it's a bit painful.
So all is boiling down to the question what execution time is
acceptable or not:
Is 8000 ticks too slow?
Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
because we have an algorithm that performs in 3000 ticks?
The answer, of course, is the same as to "Is 190 bytes too big?":
It Depends On The Application.
And as library writers, we don't *know* the application. For many
applications, an AVR is too slow and they need an ARM or a DSP.
"Best" is somewhere on the convex hull of the space/time plot, but I
can't say where.
I was designing a helper function for printf, and that's a certain size:
text data bss dec hex filename
1648 0 0 1648 670 ./avr/lib/avr4/vfprintf_flt.o
948 0 0 948 3b4 ./avr/lib/avr4/vfprintf_std.o
505 0 0 505 1f9 ./avr/lib/avr4/vfprintf_min.o
That, and the initial size of __ultoa_invert, kind of set the size scale
of what I was aiming for.
I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table
to do everything two digits at a time. Some applications might want that.
My strong preference is still to have a one-fits-all algorithm that
might very well be slower than an optimal one. But hey, an ordinary
division of a 64-bit value by 10 already costs 2300 cycles, so why
should we hunt cycles just for printf...?
There is that. But an intelligent programmer might only use 64 bits
for a final accumulation (of, say, 32x32-bit products), avoiding
even 64-bit multiplies.
Often programmers are bitten by their smartness when they observe
that avr-gcc generates "some" extra instructions around the core
64-bit arithmetic. But that's a different story...
I played around some more; attached is a version that operates on
memory, variable length and radices, and doesn't need MUL. It's
called mem_toa and needs < 70 bytes / 8000 ticks for decimal.
The signed version adds yet another 40 bytes:
I made the following changes:
Since you already have pointers to the beginning and end of the string,
which is what the first half of strrev finds, a slight rearrangement of
that code (which saves space and 1 cycle per byte at the expense of one
extra loop iteration for odd-length inputs) would let you jump to the
middle of it.
Copying that code to the end of this to make my benchmarking
simpler, I get 7296 cycles to print 2^64-1.
IMO including the time of strrev doesn't add much to the execution
times; and if strrev is invoked it actually adds to the conversion
times anyway.
One speedup that leaps to mind is that you can negate the radix and
do the trial subtraction using an add. That will produce the quotient
with the right polarity. Moves one instruction out of the middle loop
to the setup code.
That saves 90 cycles, taking it to 7206.
Just a small speed-up, but really cool idea :-)
Another possibility is some bit-shifting tricks. It's of limited use here
because it messes up your trick of using Num for both the input number and
the output quotient. But consider the following, with "nBits" renamed to
"Quot". It takes another cycle out of the second loop, saving
another 91 cycles (7115).
It starts "Quot" out equal to 1, and detect the completion of 8 bit
shifts when that bit ends up in the carry flag. So "rol Quot" replaces
"dec nBits", and thereby avoids the need for the second "rol Num" just
before the store.
DEFUN mem_toa
wmov pBuf, 24
wmov pNum, 22
neg Radix
.LnextDigit:
add pNum, nBytes
adc pNum+1, __zero_reg__
mov Count, nBytes
;; nBytes < 0 <==> nBytes is unknown
ldi nBytes, -1
clr Rem
.LnextByte:
ld Num, -X
ldi Quot, 1
.LnextBit:
;; Bit-wise construct quotient into Quot.
;; Bit-wise reconstruct Num into Rem.
lsl Num
rol Rem
;; Reducing Rem mod Radix; C = 1 <==> reduction applied
add Rem, Radix
brcs 1f
sub Rem, Radix
1:
rol Quot
brcc .LnextBit
st X, Quot
breq 2f
sbrc nBytes, 7
mov nBytes, Count
2:
dec Count
brne .LnextByte
subi Rem, -'0'
cpi Rem, '9' + 1
brlt 3f
subi Rem, '9' + 1 - 'a'
3:
st Z+, Rem
sbrs nBytes, 7
rjmp .LnextDigit
st Z, __zero_reg__
;; pBuf survived in R25:R24
# XJMP strrev
wmov pNum, 24
2: ld Num, X
ld Count, -Z
st X+, Count
st Z, Num
/* Loop while X < Z */
3: cp XL, ZL
cpc XH, ZH
brlo 2b
ret
ENDF mem_toa
- [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/14
- [avr-libc-dev] Working octal code (FYI), George Spelvin, 2016/12/16
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump),
Georg-Johann Lay <=
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/21
- [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/23
- Re: [avr-libc-dev] Even faster decimal code, Georg-Johann Lay, 2016/12/24
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/24
- Re: [avr-libc-dev] Even faster decimal code, Martin McKee, 2016/12/28
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/28
- Re: [avr-libc-dev] Even faster decimal code, Martin McKee, 2016/12/29
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/29
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/21