avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Even faster decimal code


From: Martin McKee
Subject: Re: [avr-libc-dev] Even faster decimal code
Date: Wed, 28 Dec 2016 09:42:04 -0700

I find all this fascinating, but I'm really not the one to be commenting on
what the best approach here is.  I will say, however, that in many of my
applications, I would be more likely to chose a speed increase over reduced
memory.  I tend to live with mostly compute-bound, control applications
though.

I've not had enough time to look at the code to make fine-grained
suggestions (and I'm out of practice with AVR ASM), but I've not felt that
the comments were too bad.  I've been able to follow the code as written
easily enough.

Cheers,
Martin Jay McKee

On Sat, Dec 24, 2016 at 10:59 AM, George Spelvin <address@hidden>
wrote:

> Georg-Johann Lay wrote:
> > George Spelvin schrieb:
> >> So now that we have several good candidates, how to proceed?
> >> What size/speed tradeoff should be the final choice?
>
> > After all it's you who will provide the final implementation and
> > testing, hence the final decision of what's appropriate, how much
> > effort will be put into the implementation, and what the final code
> > will look like is your decision, IMO.
>
> Well, thank you very much, but after your "that's quite some size
> increase" e-mail (and showing me better code than I'd been working on
> for a couple of weeks), I'm feeling rather less confident.
>
> (And, despite my asking, nobody's expressed any opinion at all about
> my "save RAM by using BCD" suggestion.  Brilliant or crackpot?)
>
> > We only have multilib granularity, and there are not so many features
> > that are related to the flash size.  One is __AVR_HAVE_JMP_CALL__ which
> > applies to devices with >= 16 KiB flash.  The next size milestone is
> > __AVR_HAVE_ELPM__ which means >= 128 KiB.  The JMP + CALL looks
> > reasonable to me; I used it for 64-bit divisions in libgcc (which leads
> > to the funny situation that 64-bit division might run faster than a
> > 32-bit division for the same values).
>
> Interesting suggestion.  I could just use the multiplierless base-100 code,
> which is smaller and still reasonably fast.
>
> And thank you very much!  I knew that HAVE_JUMP_CALL meant that RJMP/RCALL
> range wasn't enough, which means more than 12 bits of PC (2^13 bytes of
> flash), but it had gotten lost in the forest of confusion.
>
> I'm befuddled by all of the different architecture options and don't
> understand the difference between most of them.  I've been slowly
> downloading data sheets for different examples from gcc's list and
> looking for differences, but it's a laborious process.  (That document
> on avr-tiny started out with me documenting my realization that avr1
> was something else.)
>
> For example, does MUL support imply MOVW support?  (I've been assuming
> so, but that's an easy edit.)
>
> And what's the relationship between MOVW support and ADIW/SBIW?  Are they
> the same feature, or are there processors with one and not the other?
>
>
> (For aggressive size squeezing, I've realized that a lot of code is wasted
> copying pointer return values from X or Z to r24:r25, only to have the
> caller copy them right back to use the pointers.  It would be lovely to
> tell if there were a way to tell gcc "expect the resturn vaue for this
> function in r30:r31".  And, sometimes, "this function preserves r18-r20".)
>
> > For smaller devices (e.g. no CALL but MUL some bytes can be squeezed
> > out by moving LDI of the constants to the inner loop which saves
> > PUSH + POP.  But on smaller devices, where xprintf is already a
> > major code consumer, a programmer might prefer something like ulltoa
> > over the bloat from xprintf.
>
> Um...I see how I can swap the Hundred constant around, but Khi/Klo are
> both used twice each, so loading them twice would not save as much.
> (If I return the end pointer rather than returning r24:r25 to the end
> for a call to strrev that's not in the current code, that avoids two
> more push/pop anyway.)
>
> The way I have the multiply organized, I have to do the two middle
> partial products first, then the low, then the high.  I can swap the
> middle ones around, but I can't make both constants' uses adjacent.
>
> >> #define Q2   r23     /* Byte 2 of 4-byte product */
> >> #define Q1   r22     /* Byte 1 (and 3) of 4-byte product */
>
> > Maybe it's a bit easier to grasp if
> >
> > #define Q3    Q1
> >
> > and then use Q3 where byte #3 is computed (starting with "clr Q1")
>
> I thought about that, but remembering that two names refer to the same
> register (and thus you may not rearrange code to overlap their usage)
> is also a pain.  I originally called them "Qeven" and "Qodd".
>
> Maybe I can just improve the comments...
>
> >>      /* Multiply Rem:Num by Khi:Klo */
> >>      mul     Num, Khi
> >>      mov     Q1, r0
> >>      mov     Q2, r1
> >
> > Can use "wmov Q1, r0"
>
> Ooh, nice!  I forgot that movw isn't limited to high registers.
>
> Let me try to improve the comments... do you still think this would
> be better with Q3?  (It dawned on me that even if the product *isn't*
> guaranteed to not overflow, the structure can compute the high half of
> a 32-bit product in only two accumulator registers if we add one more
> "ADC Q1,Q1".)
>
> >>      mul     Rem, Klo
> >>      add     Q1, r0
> >>      adc     Q2, r1          ; Cannot overflow
> >>      mul     Num, Klo
> >>      add     Q1, r1
>         clr     Q1              ; No longer need Q1; re-use register for Q3
>         adc     Q2, Q1          ; Propagate carry to Q2
>         ;adc    Q1, Q1          ; (Omit: no carry possible due to input
> range)
> >>      mul     Rem, Khi
>         add     Q2, r0          ; Now byte 2 (hlo, Q2) of 32-bit product
>         adc     Q1, r1          ; Now msbyte (hhi, Q3) of 32-bit product
> >>
> >>      ; We now have the high 12 bits of the 28-bit product in Q1:Q2.
> >>      ; Shift down by 4 bits
> >>      andi    Q2, 0xf0
> >>      or      Q1, Q2
> >>      swap    Q1
> >>      ;; We now have the new quotient in "Q1".
> >>      st      Z, Q1
>
> _______________________________________________
> AVR-libc-dev mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/avr-libc-dev
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]