avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Even faster decimal code


From: George Spelvin
Subject: Re: [avr-libc-dev] Even faster decimal code
Date: 24 Dec 2016 12:59:34 -0500

Georg-Johann Lay wrote:
> George Spelvin schrieb:
>> So now that we have several good candidates, how to proceed?
>> What size/speed tradeoff should be the final choice?

> After all it's you who will provide the final implementation and
> testing, hence the final decision of what's appropriate, how much
> effort will be put into the implementation, and what the final code
> will look like is your decision, IMO.

Well, thank you very much, but after your "that's quite some size
increase" e-mail (and showing me better code than I'd been working on
for a couple of weeks), I'm feeling rather less confident.

(And, despite my asking, nobody's expressed any opinion at all about
my "save RAM by using BCD" suggestion.  Brilliant or crackpot?)

> We only have multilib granularity, and there are not so many features
> that are related to the flash size.  One is __AVR_HAVE_JMP_CALL__ which
> applies to devices with >= 16 KiB flash.  The next size milestone is
> __AVR_HAVE_ELPM__ which means >= 128 KiB.  The JMP + CALL looks
> reasonable to me; I used it for 64-bit divisions in libgcc (which leads
> to the funny situation that 64-bit division might run faster than a
> 32-bit division for the same values).

Interesting suggestion.  I could just use the multiplierless base-100 code,
which is smaller and still reasonably fast.

And thank you very much!  I knew that HAVE_JUMP_CALL meant that RJMP/RCALL
range wasn't enough, which means more than 12 bits of PC (2^13 bytes of
flash), but it had gotten lost in the forest of confusion.

I'm befuddled by all of the different architecture options and don't
understand the difference between most of them.  I've been slowly
downloading data sheets for different examples from gcc's list and
looking for differences, but it's a laborious process.  (That document
on avr-tiny started out with me documenting my realization that avr1
was something else.)

For example, does MUL support imply MOVW support?  (I've been assuming
so, but that's an easy edit.)

And what's the relationship between MOVW support and ADIW/SBIW?  Are they
the same feature, or are there processors with one and not the other?


(For aggressive size squeezing, I've realized that a lot of code is wasted
copying pointer return values from X or Z to r24:r25, only to have the
caller copy them right back to use the pointers.  It would be lovely to
tell if there were a way to tell gcc "expect the resturn vaue for this
function in r30:r31".  And, sometimes, "this function preserves r18-r20".)

> For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
> out by moving LDI of the constants to the inner loop which saves
> PUSH + POP.  But on smaller devices, where xprintf is already a
> major code consumer, a programmer might prefer something like ulltoa
> over the bloat from xprintf.

Um...I see how I can swap the Hundred constant around, but Khi/Klo are
both used twice each, so loading them twice would not save as much.
(If I return the end pointer rather than returning r24:r25 to the end
for a call to strrev that's not in the current code, that avoids two
more push/pop anyway.)

The way I have the multiply organized, I have to do the two middle
partial products first, then the low, then the high.  I can swap the
middle ones around, but I can't make both constants' uses adjacent.

>> #define Q2   r23     /* Byte 2 of 4-byte product */
>> #define Q1   r22     /* Byte 1 (and 3) of 4-byte product */

> Maybe it's a bit easier to grasp if
>
> #define Q3    Q1
>
> and then use Q3 where byte #3 is computed (starting with "clr Q1")

I thought about that, but remembering that two names refer to the same
register (and thus you may not rearrange code to overlap their usage)
is also a pain.  I originally called them "Qeven" and "Qodd".

Maybe I can just improve the comments...

>>      /* Multiply Rem:Num by Khi:Klo */
>>      mul     Num, Khi
>>      mov     Q1, r0
>>      mov     Q2, r1
>
> Can use "wmov Q1, r0"

Ooh, nice!  I forgot that movw isn't limited to high registers.

Let me try to improve the comments... do you still think this would
be better with Q3?  (It dawned on me that even if the product *isn't*
guaranteed to not overflow, the structure can compute the high half of
a 32-bit product in only two accumulator registers if we add one more
"ADC Q1,Q1".)

>>      mul     Rem, Klo
>>      add     Q1, r0
>>      adc     Q2, r1          ; Cannot overflow
>>      mul     Num, Klo
>>      add     Q1, r1
        clr     Q1              ; No longer need Q1; re-use register for Q3
        adc     Q2, Q1          ; Propagate carry to Q2
        ;adc    Q1, Q1          ; (Omit: no carry possible due to input range)
>>      mul     Rem, Khi
        add     Q2, r0          ; Now byte 2 (hlo, Q2) of 32-bit product
        adc     Q1, r1          ; Now msbyte (hhi, Q3) of 32-bit product
>> 
>>      ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
>>      ; Shift down by 4 bits
>>      andi    Q2, 0xf0
>>      or      Q1, Q2
>>      swap    Q1
>>      ;; We now have the new quotient in "Q1".
>>      st      Z, Q1



reply via email to

[Prev in Thread] Current Thread [Next in Thread]