[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-libc-dev] Even faster decimal code
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-libc-dev] Even faster decimal code |
Date: |
Sat, 24 Dec 2016 12:42:42 +0100 |
User-agent: |
Thunderbird 2.0.0.24 (Windows/20100228) |
George Spelvin schrieb:
I mentioned last time trying doing something like this, and now
I have it coded up (as utoa_bytes()).
Input Decimal
bits mem_toa mem_tod itoa nibbles bytes
8 269 220 217 141 143
16 664 462 451 321 273
24 1294 783 838 608 432
32 2059 1219 1167 948 666
40 3059 1775 1649 1395 941
48 4194 2373 2127 1895 1217
56 5477 3069 2733 2459 1551
64 6995 3822 3420 3130 1902
It's larger (120 bytes, vs, 90 for the nibbles code), and has a bit more
startup overhead, but is quite fast. I also have a 122-byte variant
which saves 1 cycle per pair of digits. (1895 cycles for 2^64-1.)
I'm interested in this both for speed, and because it's a good fit to my
suggestion of saving RAM space buffering the converted digits in BCD.
So now that we have several good candidates, how to proceed?
What size/speed tradeoff should be the final choice?
After all it's you who will provide the final implementation and
testing, hence the final decision of what's appropriate, how much
effort will be put into the implementation, and what the final code
will look like is your decision, IMO.
Personally, I'm not worried about 30 bytes of code on enhanced AVRs with
a multiplier, and although I haven't coded up the combined algorithm, I
believe the whole thing is smaller than the ultoa_invert code it replaces.
We only have multilib granularity, and there are not so many features
that are related to the flash size. One is __AVR_HAVE_JMP_CALL__ which
applies to devices with >= 16 KiB flash. The next size milestone is
__AVR_HAVE_ELPM__ which means >= 128 KiB. The JMP + CALL looks
reasonable to me; I used it for 64-bit divisions in libgcc (which leads
to the funny situation that 64-bit division might run faster than a
32-bit division for the same values).
But this is a judgement call, and I'd appreciate some other voices.
For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
out by moving LDI of the constants to the inner loop which saves
PUSH + POP. But on smaller devices, where xprintf is already a
major code consumer, a programmer might prefer something like ulltoa
over the bloat from xprintf.
#define __zero_reg__ r1
#define __tmp_reg__ r0
.macro DEFUN name
.section .text.\name,"ax",@progbits
.global \name
.func \name
\name:
.endm
.macro ENDF name
.size \name, .-\name
.endfunc
.endm
#if defined (__AVR_HAVE_JMP_CALL__)
#define XCALL call
#define XJMP jmp
#else
#define XCALL rcall
#define XJMP rjmp
#endif
.macro wmov r_dest, r_src
#if defined (__AVR_HAVE_MOVW__)
movw \r_dest, \r_src
#else
mov \r_dest, \r_src
mov \r_dest+1, \r_src+1
#endif
.endm
.text
[...u64toa_nibbles...]
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; This works simularly, but in base 100. Every step, we take a
;; remainder <= 99, multiply by 256, and add the next byte (<=255).
;; The result is in the range 0 <= x < 25600.
;;
;; To divide by 100 with multiply:
;; x * 0x29 >> 12 == x/100 for 0 <= x < 1099
;; x * 0x51f >> 17 == x/100 for 0 <= x < 4699
;; x * 0x147b >> 19 == x/100 for 0 <= x < 43699
;; x * 0x28f6 >> 20 == x/100 for 0 <= x < 43699
;;
;; Using a multiplier one bit larger than strictly necessary gives us
;; a more convenient shift amount. This is still small enough that we
;; can compute the high bits of the product we need in only two registers.
;;
;; The largest possible x = 0x63ff also has the largest possible
;; lsbyte, so it will give us the largest possible partial products.
;; Multiplying that by 0x28f6 requires four partial products:
;;
;; FF * F6 = F50A
;; FF * 28-- = 27D8--
;; 63-- * F6 = 5F22--
;; 63-- * 28-- = F78----
;;
;; The important thing is that the sum of the first three partial products
;; is 0x87ef0a, a 24-bit number. So there is no need to propagate
;; carries into the msbyte. We immediately discard the lsbyte of the
;; first partial product, and sum the others into a 2-byte register.
;; Then we throw away the lsbyte of that register and use it for the msbyte
;; of the final sum.
;; extern void utoa_bytes(char *pBuf, void *pNum, uint8_t Length)
;; Length in [1, 127]
#define ZH r31
#define ZL r30
#define XH r27
#define XL r26
#define Q2 r23 /* Byte 2 of 4-byte product */
> #define Q1 r22 /* Byte 1 (and 3) of 4-byte product */
Maybe it's a bit easier to grasp if
#define Q3 Q1
and then use Q3 where byte #3 is computed (starting with "clr Q1")
#define Count r21
#define Length r20 /* Argument */
#define Rem r19
#define Hundred r18 /* Constant 0x64 = 100 */
#define Khi r17 /* Constant 0x28 = 40 */
#define Klo r16 /* Constant 0xf6 = 246 */
#define Num r15
DEFUN utoa_bytes
movw XL, r24 ; pBuf (output) to X
movw ZL, r22 ; pNum (input) to Z
ldi Hundred, 100
push Khi
ldi Khi, 0x28
push Klo
ldi Klo, 0xf6
push Num
clr Count
.LoopDigits2:
add ZL, Length
adc ZH, Count
mov Count, Length
ldi Length, -1
clr Rem
.LoopBytes2:
ld Num, -Z
/* Multiply Rem:Num by Khi:Klo */
mul Num, Khi
mov Q1, r0
mov Q2, r1
Can use "wmov Q1, r0"
mul Rem, Klo
add Q1, r0
adc Q2, r1 ; Cannot overflow
mul Num, Klo
add Q1, r1
clr Q1 ; We only need the carry bit out
adc Q2, Q1 ; This cannot overflow, either
mul Rem, Khi
add Q2, r0
adc Q1, r1
; We now have the high 12 bits of the 28-bit product in Q1:Q2.
; Shift down by 4 bits
andi Q2, 0xf0
or Q1, Q2
swap Q1
;; We now have the new quotient in "Q1".
st Z, Q1
;; If Q1 == 0, don't set length (and multiply is redundant, too)
breq 1f
sbrc Length, 7 ;; Do we have a length yet?
mov Length, Count ;; Remember position of highest non-zero Q
mul Q1, Hundred
sub Num, r0 ; New remainder
1:
mov Rem, Num
dec Count
brne .LoopBytes2
;; We now have the final base-100 remiander in Rem.
;; Break it apart into two base-10 digits in Q1:Rem
;; (We could use the multiplier again, but that woud require
;; even more registers for the constants and more instructions,
;; and this isn't the inner loop.)
ldi Q1, '0'
3: subi Q1, -3
subi Rem, 30
brcc 3b
4: dec Q1
subi Rem, -10
brcs 4b
subi Rem, -'0'
st X+, Rem
st X+, Q1
;; Only continue if we determined Length (Length > 0)
;; Length < 0 means all bytes are zero, hence we are done.
sbrs Length, 7
rjmp .LoopDigits2
;; Chop off redundant trailing zero
clr __zero_reg__
cpi Q1, '0'
brne 5f
st -X, __zero_reg__ ; Could sbiw, but this is same speed
5: st X, __zero_reg__
; Restore registers
pop Num
pop Klo
pop Khi
movw r24, r26 ; Return end of output
ret
ENDF utoa_bytes
#undef Num
#undef Klo
#undef Khi
#undef Hundred
#undef Rem
#undef Length
#undef Count
#undef Q1
#undef Q2
#undef XL
#undef XH
#undef ZL
#undef ZH
- Re: [avr-libc-dev] Working octal code (FYI), (continued)
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/21
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/21
- [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/23
- Re: [avr-libc-dev] Even faster decimal code,
Georg-Johann Lay <=
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/24
- Re: [avr-libc-dev] Even faster decimal code, Martin McKee, 2016/12/28
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/28
- Re: [avr-libc-dev] Even faster decimal code, Martin McKee, 2016/12/29
- Re: [avr-libc-dev] Even faster decimal code, George Spelvin, 2016/12/29
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/19
- Re: [avr-libc-dev] Printing octal (brain dump), Georg-Johann Lay, 2016/12/21
- Re: [avr-libc-dev] Printing octal (brain dump), George Spelvin, 2016/12/21