avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Faster integer division


From: Ambroz Bizjak
Subject: Re: [avr-libc-dev] Faster integer division
Date: Mon, 10 Jun 2013 02:22:14 +0200

This code is now available in a github repo:
https://github.com/ambrop72/avr-asm-ops

I've implemented the partially unrolled variants with four loops. The
32bit/32bit division done this way (div_32_32_small) takes 510 cycles
in the worst case (div_32_32_large takes 384, gcc takes 653). I think
the _small variants would be appropriate for gcc to use in the absence
of -Os. I don't know how to measure code size properly, but
instruction-wise, div_32_32_small is 62 versus gcc'c 24 instructions.

On Sat, Jun 8, 2013 at 6:25 PM, Ambroz Bizjak <address@hidden> wrote:
> Oh, sorry, a small mistake. Compacting it to a *single* loop while
> still being faster than gcc may be harder than it looks. This is
> because the part where a bit of the result is written (a single "ori"
> in my code) may explode into a full 32-bit operation (e.g. increment
> "q" if bit needs to be set, then shift left the whole "q"). But
> perhaps it could be done faster by involving memory access (keeping a
> pointer to the current byte in "q").
>
> On Sat, Jun 8, 2013 at 6:15 PM, Ambroz Bizjak <address@hidden> wrote:
>> I've done some further optimization, now it's even faster and uses 4
>> registers less (but needs >=r16 for the result due to the use of
>> immediate operands). Even better, there are now only 4 kinds of
>> iterations (4 macros). The code is here http://ideone.com/EGWcuI and
>> pasted below.
>>
>> I've calculated that my division takes 384 instructions in the worst
>> case. I've also disassembled a compiled program where gcc did the
>> division (-O4, the attached program, just changed the main func), see
>> here http://ideone.com/jZNxkF . I see that the call to __udivmodsi4
>> really hasn't been inlined, but the extra instructions are almost
>> negligible. In my calculations the gcc division takes 653 instructions
>> in the worst case (may be a bit off), excluding the call entry and
>> exit. Put simply, my code takes 58% time compared to gcc's division.
>> When I said 2/3 previously, I was referring to relative times, not
>> probabilities, but this was in a benchmark so it appeared slower due
>> to the testing code.
>>
>> The sources of speedup in my algorithm seem to be:
>> 1. Algorithm on the high level. In each iteration, gcc does four
>> 32-bit operations (add, compare, subtract, add) in the worst case. I
>> only do three (shift, compare, subtract).
>> 2. Optimization of groups of 8 iterations. This allows removing some
>> instructions where some operands are known to be zero.
>> 3. Unrolling individual iterations.
>>
>> About code size, yes, the algorithm I attached here is indeed much
>> larger. However, I believe we can compact it into four loops and still
>> be faster than gcc (points 1 and 2 above would still apply). I think
>> the result would be about twice the size of gcc algorithm. Even
>> compacting it to one loop should still be faster than gcc due to point
>> 1.
>>
>> I don't really know when a faster but larger algorithm should be used
>> by gcc, this is part of the reason why I'm posting here. The -Os flag
>> is probably one thing to look for, but we need to consider than people
>> may not want a giant division code even at other optimization levels.
>> Perhaps more specific flags could be used to control the division
>> algorithm? On the other hand, if the large algorithm goes into an
>> inline function in avr-libc, how do we make sure people can find it?
>>
>>
>> #include <stdint.h>
>>
>> #define DIVIDE_ITER_0_7(i) \
>> "    lsl %D[n]\n" \
>> "    rol __tmp_reg__\n" \
>> "    cp __tmp_reg__,%A[d]\n" \
>> "    cpc __zero_reg__,%B[d]\n" \
>> "    cpc __zero_reg__,%C[d]\n" \
>> "    cpc __zero_reg__,%D[d]\n" \
>> "    brcs zero_bit_" #i "__%=\n" \
>> "    sub __tmp_reg__,%A[d]\n" \
>> "    ori %D[q],1<<(7-" #i ")\n" \
>> "zero_bit_" #i "__%=:\n"
>>
>> #define DIVIDE_ITER_8_15(i) \
>> "    lsl %C[n]\n" \
>> "    rol __tmp_reg__\n" \
>> "    rol %D[n]\n" \
>> "    cp __tmp_reg__,%A[d]\n" \
>> "    cpc %D[n],%B[d]\n" \
>> "    cpc __zero_reg__,%C[d]\n" \
>> "    cpc __zero_reg__,%D[d]\n" \
>> "    brcs zero_bit_" #i "__%=\n" \
>> "    sub __tmp_reg__,%A[d]\n" \
>> "    sbc %D[n],%B[d]\n" \
>> "    ori %C[q],1<<(15-" #i ")\n" \
>> "zero_bit_" #i "__%=:\n"
>>
>> #define DIVIDE_ITER_16_23(i) \
>> "    lsl %B[n]\n" \
>> "    rol __tmp_reg__\n" \
>> "    rol %D[n]\n" \
>> "    rol %C[n]\n" \
>> "    cp __tmp_reg__,%A[d]\n" \
>> "    cpc %D[n],%B[d]\n" \
>> "    cpc %C[n],%C[d]\n" \
>> "    cpc __zero_reg__,%D[d]\n" \
>> "    brcs zero_bit_" #i "__%=\n" \
>> "    sub __tmp_reg__,%A[d]\n" \
>> "    sbc %D[n],%B[d]\n" \
>> "    sbc %C[n],%C[d]\n" \
>> "    ori %B[q],1<<(23-" #i ")\n" \
>> "zero_bit_" #i "__%=:\n"
>>
>> #define DIVIDE_ITER_24_30(i) \
>> "    lsl %A[n]\n" \
>> "    rol __tmp_reg__\n" \
>> "    rol %D[n]\n" \
>> "    rol %C[n]\n" \
>> "    rol %B[n]\n" \
>> "    cp __tmp_reg__,%A[d]\n" \
>> "    cpc %D[n],%B[d]\n" \
>> "    cpc %C[n],%C[d]\n" \
>> "    cpc %B[n],%D[d]\n" \
>> "    brcs zero_bit_" #i "__%=\n" \
>> "    sub __tmp_reg__,%A[d]\n" \
>> "    sbc %D[n],%B[d]\n" \
>> "    sbc %C[n],%C[d]\n" \
>> "    sbc %B[n],%D[d]\n" \
>> "    ori %A[q],1<<(31-" #i ")\n" \
>> "zero_bit_" #i "__%=:\n"
>>
>> static inline uint32_t divide (uint32_t n, uint32_t d)
>> {
>>     uint32_t q = 0;
>>
>>     asm(
>>         "clr __tmp_reg__\n"
>>         DIVIDE_ITER_0_7(0)
>>         DIVIDE_ITER_0_7(1)
>>         DIVIDE_ITER_0_7(2)
>>         DIVIDE_ITER_0_7(3)
>>         DIVIDE_ITER_0_7(4)
>>         DIVIDE_ITER_0_7(5)
>>         DIVIDE_ITER_0_7(6)
>>         DIVIDE_ITER_0_7(7)
>>         DIVIDE_ITER_8_15(8)
>>         DIVIDE_ITER_8_15(9)
>>         DIVIDE_ITER_8_15(10)
>>         DIVIDE_ITER_8_15(11)
>>         DIVIDE_ITER_8_15(12)
>>         DIVIDE_ITER_8_15(13)
>>         DIVIDE_ITER_8_15(14)
>>         DIVIDE_ITER_8_15(15)
>>         DIVIDE_ITER_16_23(16)
>>         DIVIDE_ITER_16_23(17)
>>         DIVIDE_ITER_16_23(18)
>>         DIVIDE_ITER_16_23(19)
>>         DIVIDE_ITER_16_23(20)
>>         DIVIDE_ITER_16_23(21)
>>         DIVIDE_ITER_16_23(22)
>>         DIVIDE_ITER_16_23(23)
>>         DIVIDE_ITER_24_30(24)
>>         DIVIDE_ITER_24_30(25)
>>         DIVIDE_ITER_24_30(26)
>>         DIVIDE_ITER_24_30(27)
>>         DIVIDE_ITER_24_30(28)
>>         DIVIDE_ITER_24_30(29)
>>         DIVIDE_ITER_24_30(30)
>>         "    lsl %A[n]\n"
>>         "    rol __tmp_reg__\n"
>>         "    rol %D[n]\n"
>>         "    rol %C[n]\n"
>>         "    rol %B[n]\n"
>>         "    cp __tmp_reg__,%A[d]\n"
>>         "    cpc %D[n],%B[d]\n"
>>         "    cpc %C[n],%C[d]\n"
>>         "    cpc %B[n],%D[d]\n"
>>         "    sbci %A[q],-1\n"
>>
>>         : [q] "=&a" (q),
>>           [n] "=&r" (n)
>>         : "[q]" (q),
>>           "[n]" (n),
>>           [d] "r" (d)
>>     );
>>
>>     return q;
>> }
>>
>> // instructions:
>> // 4 (init q) + 1 + 8 * 9 + 8 * 11 + 8 * 13 + 7 * 15 + 10 = 384
>>
>> volatile uint32_t test1;
>> volatile uint32_t test2;
>> volatile uint32_t test3;
>>
>> int main ()
>> {
>>     test3 = divide(test1, test2);
>>     //test3 = test1 / test2;
>> }
>>
>> On Sat, Jun 8, 2013 at 4:31 PM, Weddington, Eric
>> <address@hidden> wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: address@hidden
>>>> [mailto:address@hidden On
>>>> Behalf Of Ambroz Bizjak
>>>> Sent: Friday, June 07, 2013 8:34 PM
>>>> To: address@hidden
>>>> Subject: [avr-libc-dev] Faster integer division
>>>>
>>>> Hi!
>>>>
>>>> I've found 32bit integer division in gcc too slow, and I managed to
>>>> implement division in asm that's faster than what gcc 4.8 produces at
>>>> -O4, about 2/3 the time (but I'm not sure if any of this is due to
>>>> inlining). The algorithm is restoring division but unrolled and
>>>> heavily optimized. Could this get in gcc or avr-libc, wherever the
>>>> right place is?
>>>>
>>>
>>> Hi Ambroz,
>>>
>>> What's difficult is to balance different needs of different users. For the 
>>> most part, and there are always exceptions to this, most of the AVR GCC 
>>> users are interested in code size, rather than speed. They would rather see 
>>> the smallest way to do division, even if it is slower. But I also recognize 
>>> that there are definitely times where speed is the most desired.
>>>
>>> So given that, where do you think it best for your algorithm to generated? 
>>> Should it be generated at a specific optimization level in gcc? Should it 
>>> be a specialized inline-assembly function call in avr-libc that is 
>>> specifically called by the user?
>>>
>>> You said that it's faster than what gcc produces about 2/3 of the time, but 
>>> you're not sure if that is due to inlining. Can you do some further testing 
>>> to see if it's due to inlining? Or whether it's because your algorithm is 
>>> better? This is important to know, to see if it's worth the time and effort 
>>> to get it in gcc or avr-libc.
>>>
>>> Thanks,
>>> Eric



reply via email to

[Prev in Thread] Current Thread [Next in Thread]