avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Printing octal (brain dump)


From: Georg-Johann Lay
Subject: Re: [avr-libc-dev] Printing octal (brain dump)
Date: Wed, 21 Dec 2016 15:17:59 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

On 19.12.2016 23:05, George Spelvin wrote:
Georg-Johann Lay wrote:
George Spelvin schrieb:
It is unfortunately one instruction longer than ultoa_invert:

   text    data     bss     dec     hex filename
    188       0       0     188      bc ultoa_invert.o
    190       0       0     190      be itoa_engine.o
    208       0       0     208      d0 itoa_engine.o (+OPTIMIZE_SPEED)
    212       0       0     212      d4 itoa_engine.o (without HAVE_MUL)

hmmm, that's quite some increase in code size...

Um... just how serious are you being?  I didn't think two bytes was
"quite some increase".  But compared to your alternatives... sigh,
I may have been wasting my time. :-(

oh, I was just misreading this table and thought that it means yet
another 200 bytes atop of ultoa_invert (just demonstrating that
it isn't worse than ultoa_invert).

But it appears you are intending to drop ultoa_invert which is great!

IMO octal conversion is so remote from any practical purposes that we
should use an algorithm that works for all radices.

Well, I sort of did that.  I did one that works for all power-of-two
sizes, and used it for both hex and octal.

Hex alone can be done much faster, but it isn't bad with the more
generic code.

...but you don't mean limited to powers of 2.  Just plain old binary long
division.  As I said when I started this, it's hard to escape certain
optimization habits.

This can be done in less than 70 bytes (variable length / radix
algorithm that operates on memory, no need for MUL, needs strrev.

H'm... let me have a poke at that.

The actual code changes are at the end, but I found 90 cycles
in one place, and 91 cycles + 1 instruction in another.

With those changes, here's a timing table, for values of the from 2^k - 1:

        Decimal         Hex             mod100
Input   mem_toa itoa    mem_toa itoa    mem_toa
 8 bit   293     217     194      98     194
16 bit   700     451     527     187     440
24 bit  1342     838    1008     276     748
32 bit  2119    1167    1637     365    1142
40 bit  3143    1649    2414     454    1697
48 bit  4290    2127    3339     543    2301
56 bit  5585    2733    4412     632    2991
64 bit  7115    3420    5633     721    3743

Really funny results (just noticed you are using TABs and therefore
the whole table is rigged ;-))

That's really attractive for decimal.  Also, for not much extra code,
we could do the long division in base 100 (timing for only this part is
above), and then split that into 2 digits.

For hex, it's a bit painful.

So all is boiling down to the question what execution time is
acceptable or not:

Is 8000 ticks too slow?

Is 3000 ticks acceptable? And for what reason? Are 3000 acceptable just
because we have an algorithm that performs in 3000 ticks?

The answer, of course, is the same as to "Is 190 bytes too big?":
It Depends On The Application.

And as library writers, we don't *know* the application.  For many
applications, an AVR is too slow and they need an ARM or a DSP.

"Best" is somewhere on the convex hull of the space/time plot, but I
can't say where.

I was designing a helper function for printf, and that's a certain size:

   text    data     bss     dec     hex filename
   1648       0       0    1648     670 ./avr/lib/avr4/vfprintf_flt.o
    948       0       0     948     3b4 ./avr/lib/avr4/vfprintf_std.o
    505       0       0     505     1f9 ./avr/lib/avr4/vfprintf_min.o

That, and the initial size of __ultoa_invert, kind of set the size scale
of what I was aiming for.

I have a faster algorithm that uses a 100-byte binary-to-BCD lookup table
to do everything two digits at a time.  Some applications might want that.

My strong preference is still to have a one-fits-all algorithm that
might very well be slower than an optimal one.  But hey, an ordinary
division of a 64-bit value by 10 already costs 2300 cycles, so why
should we hunt cycles just for printf...?

There is that.  But an intelligent programmer might only use 64 bits
for a final accumulation (of, say, 32x32-bit products), avoiding
even 64-bit multiplies.

Often programmers are bitten by their smartness when they observe
that avr-gcc generates "some" extra instructions around the core
64-bit arithmetic.  But that's a different story...

I played around some more; attached is a version that operates on
memory, variable length and radices, and doesn't need MUL. It's
called mem_toa and needs < 70 bytes / 8000 ticks for decimal.
The signed version adds yet another 40 bytes:

I made the following changes:

Since you already have pointers to the beginning and end of the string,
which is what the first half of strrev finds, a slight rearrangement of
that code (which saves space and 1 cycle per byte at the expense of one
extra loop iteration for odd-length inputs) would let you jump to the
middle of it.

Copying that code to the end of this to make my benchmarking
simpler, I get 7296 cycles to print 2^64-1.

IMO including the time of strrev doesn't add much to the execution
times; and if strrev is invoked it actually adds to the conversion
times anyway.


One speedup that leaps to mind is that you can negate the radix and
do the trial subtraction using an add.  That will produce the quotient
with the right polarity.  Moves one instruction out of the middle loop
to the setup code.

That saves 90 cycles, taking it to 7206.

Just a small speed-up, but really cool idea :-)

Another possibility is some bit-shifting tricks.  It's of limited use here
because it messes up your trick of using Num for both the input number and
the output quotient.  But consider the following, with "nBits" renamed to
"Quot".  It takes another cycle out of the second loop, saving
another 91 cycles (7115).

It starts "Quot" out equal to 1, and detect the completion of 8 bit
shifts when that bit ends up in the carry flag.  So "rol Quot" replaces
"dec nBits", and thereby avoids the need for the second "rol Num" just
before the store.

DEFUN mem_toa
        wmov    pBuf, 24
        wmov    pNum, 22
        neg     Radix

.LnextDigit:
        add     pNum, nBytes
        adc     pNum+1, __zero_reg__
        mov     Count, nBytes
        ;; nBytes < 0  <==>  nBytes is unknown
        ldi     nBytes, -1
        clr     Rem

.LnextByte:
        ld      Num, -X
        ldi     Quot, 1

.LnextBit:
        ;; Bit-wise construct quotient into Quot.
        ;; Bit-wise reconstruct Num into Rem.
        lsl     Num
        rol     Rem

        ;; Reducing Rem mod Radix;  C = 1  <==>  reduction applied
        add     Rem, Radix
        brcs    1f
        sub     Rem, Radix
1:
        rol     Quot
        brcc    .LnextBit

        st      X, Quot

        breq    2f
        sbrc    nBytes, 7
        mov     nBytes, Count
2:
        dec     Count
        brne    .LnextByte

        subi    Rem, -'0'
        cpi     Rem, '9' + 1
        brlt    3f
        subi    Rem, '9' + 1 - 'a'
3:
        st      Z+, Rem

        sbrs    nBytes, 7
        rjmp    .LnextDigit

        st      Z, __zero_reg__

        ;; pBuf survived in R25:R24
        # XJMP  strrev
        wmov    pNum, 24
2:      ld      Num, X
        ld      Count, -Z
        st      X+, Count
        st      Z, Num
        /* Loop while X < Z */
3:      cp      XL, ZL
        cpc     XH, ZH
        brlo    2b

        ret
ENDF mem_toa





reply via email to

[Prev in Thread] Current Thread [Next in Thread]