[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[avr-gcc-list] Missing compiler optimizations?
From: |
Erik Walthinsen |
Subject: |
[avr-gcc-list] Missing compiler optimizations? |
Date: |
Sat, 21 Jan 2006 22:58:08 -0800 |
User-agent: |
Debian Thunderbird 1.0.7 (X11/20051017) |
I'm going through and converting select functions in my program to
assembly, sometimes for code size, sometimes for speed. Usually I get
about 2x improvement in one category or the other, non infrequently both
at the same time.
The scary thing is that I'm looking at what GCC is producing (both
3.4.4[avr-patched] and 4.0.2[un-patched]) and some of it is missing
really *basic* optimization concepts. It's a little shocking to me that
4.0 is generally slightly worse than 3.4, which is not doing very great
to begin with.
For example:
#define PARTPINS 28
uint8_t pinconfig[PARTPINS];
void load_eeprom_defaults() {
uint8_t i;
uint8_t *ee = 0, *ram;
ram = (uint8_t *)pinconfig;
for (i=0;i<PARTPINS;i++) {
pinconfig[i] = bootloader_eeprom_read_byte(ee++);
}
}
With gcc 3.4.4 I get:
void load_eeprom_defaults() {
678: ef 92 push r14
67a: ff 92 push r15
67c: 1f 93 push r17
67e: cf 93 push r28
680: df 93 push r29
uint8_t i;
uint8_t *ee = 0, *ram;
682: c0 e0 ldi r28, 0x00 ; 0
684: d0 e0 ldi r29, 0x00 ; 0
/* pinconfig */
ram = (uint8_t *)pinconfig;
686: 8f e3 ldi r24, 0x3F ; 63
688: e8 2e mov r14, r24
68a: 81 e0 ldi r24, 0x01 ; 1
68c: f8 2e mov r15, r24
68e: 1b e1 ldi r17, 0x1B ; 27
for (i=0;i<PARTPINS;i++) {
*(ram++) = bootloader_eeprom_read_byte(ee++);
690: ce 01 movw r24, r28
692: 21 96 adiw r28, 0x01 ; 1
694: a6 d9 rcall .-3252 ; 0xfffff9e2
<__eeprom_end+0xff7ef9e2>
696: f7 01 movw r30, r14
698: 81 93 st Z+, r24
69a: 7f 01 movw r14, r30
69c: 11 50 subi r17, 0x01 ; 1
69e: 17 ff sbrs r17, 7
6a0: f7 cf rjmp .-18 ; 0x690
<load_eeprom_defaults+0x18>
6a2: df 91 pop r29
6a4: cf 91 pop r28
6a6: 1f 91 pop r17
6a8: ff 90 pop r15
6aa: ef 90 pop r14
6ac: 08 95 ret
}
GCC 4.0.2 is a bittle better, shaving two push/pops but gaining an
instruction somewhere else.
By hand, I get:
.global load_eeprom_defaults_asm
load_eeprom_defaults_asm:
/* Registers:
r31:r30 ram address
r27:r26 eeprom address
r25:r24 eeprom_read_byte address argument
r24 eeprom read byte
r23 i (port)
*/
ldi r26, lo8(0) /* load zero EEPROM address */
ldi r27, hi8(0)
/* load the pinconfig array */
ldi r30, lo8(pinconfig) /* load Z with pinconfig */
ldi r31, hi8(pinconfig)
ldi r23, PARTPINS /* start loop with PARTPINS */
pinconfigLoop:
movw r24, r26 /* copy EEPROM addr to argument */
rcall eeprom_read_byte
st Z+, r24 /* store the result in pinconfig[] */
adiw r26, 1 /* increment the EEPROM addr */
dec r23 /* decrement pin count */
brne pinconfigLoop /* ...and loop */
ret
That's a reduction from 27 instructions to 12, and *zero* stack space.
Run time (non-critical in this case, but...) should be ~15 cycles
instead of ~40.
Right off the bat the compiler is making use of mostly low registers, as
well as Y (r29:r28), all of which have to be saved. Why isn't it using
the high registers that are caller-saved (r18-r27, r30-r31) instead, and
saving more than a third of the code size and half the cycle count just
in the push/pops?
The use of low registers has a number of other major disadvantages.
First off, in loading the 'ram' pointer, it has to ldi into high
registers (ldi only works on r16-r31), then mov into the low registers.
Another one, not seen in this particular example unless you switch
from *(ram++) to pinconfig[i] as the lvalue, is the loss of adiw.
Because the compiler stores things in low registers, it has to use
subi+subc of 0xffff just to increment the pointer by one. adiw works if
you use the top 4 pairs of registers, 3 of which are conveniently
caller-saved.
In other cases I've seen all kinds of relatively common idioms get
translated very poorly. A classic one is constructing a word from two
bytes, which in my case happens as TWI interrupts feed them to me one at
a time:
w = b | (c<<8);
68: 99 27 eor r25, r25
6a: 33 27 eor r19, r19
6c: 32 2f mov r19, r18
6e: 22 27 eor r18, r18
70: 82 2b or r24, r18
72: 93 2b or r25, r19
w is 16 bits, b and c are both 8. This should be doable with just two
mov's, but instead it takes 6 instructions. I understand that it's
promoting the variables, but it's still inefficient.
I understand that the larger issues of register selection are rather
systemic and thus probably hard to solve, but a lot of these smaller
idioms seem like they're just crying out for a simple RTL match that can
generate efficient code. I know next to nothing about gcc specifically
and compilers in general, but I know gcc has a peephole optimizer, and
AFAIU that's what's supposed to solve these cases.
What would it take in order to collect and implement improvements to
idioms like this? I run into them fairly regularly, and if someone
knows how to convert these things into something the compiler can use,
I'll gladly write them up and document what I find. Heck, if it's easy
enough to add them to the compiler I'll do the whole thing.
TIA,
Omega
aka Erik Walthinsen
address@hidden
- [avr-gcc-list] Missing compiler optimizations?,
Erik Walthinsen <=