[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-gcc-list] Poor handling of non-power-of-2 multiplication on ATt
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-gcc-list] Poor handling of non-power-of-2 multiplication on ATtiny |
Date: |
Mon, 06 Apr 2015 18:40:02 +0200 |
User-agent: |
Thunderbird 2.0.0.24 (Windows/20100228) |
Paul "LeoNerd" Evans schrieb:
I'm writing code for an ATtiny (tiny84 specifically).
I currently have an array and a function defined thus:
static struct {
uint8_t state;
uint8_t v1;
} leds[3];
uint8_t led_state(uint8_t i) { return leds[i].state; }
Elements of this array are 2 bytes long, and this function compiles to
some fairly sensible looking code:
000003b8 <led_state>:
3b8: e8 2f mov r30, r24
3ba: f0 e0 ldi r31, 0x00 ; 0
3bc: ee 0f add r30, r30
3be: ff 1f adc r31, r31
3c0: e1 57 subi r30, 0x71 ; 113
3c2: ff 4f sbci r31, 0xFF ; 255
3c4: 80 81 ld r24, Z
3c6: 08 95 ret
The instruction at 3bc adds r30 to itself, effectively causing it to
contain 2*i, which is the offset in bytes required to be added to the
base address of the `leds` array to find the required field.
If instead I define my array thus:
static struct {
uint8_t state;
uint16_t v1;
} leds[3];
I get the most horrible:
000003be <led_state>:
3be: 90 e0 ldi r25, 0x00 ; 0
3c0: 63 e0 ldi r22, 0x03 ; 3
3c2: 70 e0 ldi r23, 0x00 ; 0
3c4: 2d d3 rcall .+1626 ; 0xa20 <__mulhi3>
3c6: 81 57 subi r24,0x71 ; 113
3c8: 9f 4f sbci r25, 0xFF ; 255
3ca: fc 01 movw r30, r24
3cc: 80 81 ld r24, Z
3ce: 08 95 ret
Here, gcc has given up trying to perform a non power-of-two
multiplication and just inlined a call to the generic __mulhi3 function
instead.
The code of __mulhi3 is not inlined, it's a library call. GCC has
currently no means to determine the accumulated costs of library calls
like the one emit in your example. Presumably you compiled with -Os and
the sequence above is actually shorter than expanding *3 inline.
This is somewhat disappointing, as if I further expand my array with a
dummy field for padding, I now get:
000003be <led_state>:
3be: e8 2f mov r30, r24
3c0: f0 e0 ldi r31, 0x00 ; 0
3c2: ee 0f add r30, r30
3c4: ff 1f adc r31, r31
3c6: ee 0f add r30, r30
3c8: ff 1f adc r31, r31
3ca: e1 57 subi r30, 0x71 ; 113
3cc: ff 4f sbci r31, 0xFF ; 255
3ce: 80 81 ld r24, Z
3d0: 08 95 ret
because it has neatly managed to perform that power-of-two
multiplication, this time of 4*i.
I feel this is a situation that could be resolved better; for instance,
a multiplication by 3 in the middle case could have been performed by
mov r30, r24
ldi r31, 0x00 ; Z == i
add r30, r30
adc r31, r31 ; Z == 2*i
add r30, r24
adc r31, r1 ; Z == 2*i + i == 3*i
in just as many instructions as the 4*i case, without my having to
That sequence is not as short as calling __mulhi3. If you prefer faster
code as indicated, try optimizing for speed with -O2.
waste extra bytes of precious RAM just to provide that dummy padding.
Furthermore, I care a lot about instruction counts to access because
this LED management code runs inside a timer match interrupt to
implement software PWM, so it has to be fast and tight.
As a workaround for now, I can restructure my code into two separate
arrays of integers, instead of one array of two-int structs, and thus
ensure that all arrays contain only a power-of-two number of bytes, so
all accesses will be efficient. However, it would be nice if gcc could
manage these multiplications a little nicer.
-----
As another side note, consider the massive difference between
uint8_t mul10(uint8_t x) { return x * 10; }
3be: 6a e0 ldi r22, 0x0A ; 10
3c0: 29 d3 rcall .+1618 ; 0xa14 <__mulqi3>
3c2: 08 95 ret
vs. something like
mov r0, r24 ; r0 = x
add r24, r24 ; r24 = 2*x
add r24, r24 ; r24 = 4*x
add r24, r0 ; r24 = 5*x
add r24, r24 ; r24 = 10*x
ret
Again, try -O2.
Is there currently underway an attempt to provide better optimisation
of these sorts of cases? If not, where/how could I contribute such?
None that I know of. Maybe adding a small penalty to the rtx costs for
such shifts will do the trick. The avr backend currently uses the
shortest sequence for -Os, maybe that could be relaxed for not-so-tiny
devices even in presence of -Os so that saving 1 instruction does not
blow up the execution time to the x-fold.
The compiler has just a very coarse understanding of flash size: it is
only aware of the architecture like -mmcu=avr4 so that adding libcall
penalty could be added if, say, flash >= 16 KiB, i.e. if there is a JMP
instruction (AVR_HAVE_JMP_CALL from avr.h).
Johann