[avr-gcc-list] avr-gcc sub-optimal code with -ftree-loop-optimize

(this is gcc-optimize-bug.txt)

I have this relatively straighforward implementation of a a couple of pins
worth of software PWM:

void pwmcycle(void)
{
    unsigned char pwm1, pwm2, pwm3, pwm4, pwm5, level_delay;
    unsigned char pwm_delay;

    getbright();
    pwm1 = bright1;
    pwm2 = bright2;
    pwm3 = bright3;
    pwm4 = bright4;
    pwm5 = bright5;
    led_all_on();
    for (pwm_delay = 128; pwm_delay !=0; pwm_delay--) {
    /*
    * Rather standard software PWM loop.
    */
    if (--pwm1 == 0) {
        led1_off();
    }
    if (--pwm2 == 0) {
        led2_off();
    }
    if (--pwm3 == 0) {
        led3_off();
    }
    if (--pwm4 == 0) {
        led4_off();
    }
    if (--pwm5 == 0) {
        led5_off();
    }
    }
}

When compiled with avr-gcc 4.6.2, it produces rather strange (but correct) code
for the loop:

/usr/local/CrossPack-AVR-20121207/bin/avr-gcc -c -mmcu=atmega8 -g -Os \
          gcc-optimize-bug.c -save-temps=obj -o gcc-optimize-bug-Os.o

   c:    00 d0           rcall    .+0          ; 0xe <pwmcycle+0xe>
   e:    c0 91 00 00     lds    r28, 0x0000   ;;pwm1
12:    f0 90 00 00     lds    r15, 0x0000   ;;pwm2
16:    00 91 00 00     lds    r16, 0x0000   ;;pwm3
1a:    10 91 00 00     lds    r17, 0x0000   ;;pwm4
1e:    d0 91 00 00     lds    r29, 0x0000   ;;pwm5
22:    00 d0           rcall    .+0          ; 0x24 <pwmcycle+0x24>
24:    80 e8           ldi    r24, 0x80    ; 128
26:    e8 2e           mov    r14, r24
28:    fc 1a           sub    r15, r28
2a:    0c 1b           sub    r16, r28
2c:    1c 1b           sub    r17, r28
2e:    dc 1b           sub    r29, r28
30:    c1 50           subi    r28, 0x01    ; 1
32:    01 f4           brne    .+0          ; 0x34 <pwmcycle+0x34>
34:    00 d0           rcall    .+0          ; 0x36 <pwmcycle+0x36>
36:    8f 2d           mov    r24, r15
38:    8c 0f           add    r24, r28
3a:    01 f4           brne    .+0          ; 0x3c <pwmcycle+0x3c>
3c:    00 d0           rcall    .+0          ; 0x3e <pwmcycle+0x3e>
3e:    80 2f           mov    r24, r16
40:    8c 0f           add    r24, r28
42:    01 f4           brne    .+0          ; 0x44 <pwmcycle+0x44>
44:    00 d0           rcall    .+0          ; 0x46 <pwmcycle+0x46>
       :

I guess this is some sort of loop optimization. I don't like that it's so
obscured from the original, but it's also not very "good." I can get more
obvious, and significantly smaller/faster code by turning off
tree-loop-optimize:

(note that -ftree-loop-optimize is turned ON by default starting at -O1)

/usr/local/CrossPack-AVR-20121207/bin/avr-gcc -c -mmcu=atmega8 -g -Os \
          gcc-optimize-bug.c -fno-tree-loop-optimize -save-temps=obj \
          -o gcc-optimize-bug-notree.o

   c:    00 d0           rcall    .+0          ; 0xe <pwmcycle+0xe>
   e:    e0 90 00 00     lds    r14, 0x0000
12:    f0 90 00 00     lds    r15, 0x0000
16:    00 91 00 00     lds    r16, 0x0000
1a:    10 91 00 00     lds    r17, 0x0000
1e:    d0 91 00 00     lds    r29, 0x0000
22:    00 d0           rcall    .+0          ; 0x24 <pwmcycle+0x24>
24:    c0 e8           ldi    r28, 0x80    ; 128
26:    ea 94           dec    r14
28:    01 f4           brne    .+0          ; 0x2a <pwmcycle+0x2a>
2a:    00 d0           rcall    .+0          ; 0x2c <pwmcycle+0x2c>
2c:    fa 94           dec    r15
2e:    01 f4           brne    .+0          ; 0x30 <pwmcycle+0x30>
30:    00 d0           rcall    .+0          ; 0x32 <pwmcycle+0x32>
32:    01 50           subi    r16, 0x01    ; 1
34:    01 f4           brne    .+0          ; 0x36 <pwmcycle+0x36>
36:    00 d0           rcall    .+0          ; 0x38 <pwmcycle+0x38>
     :

I found http://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html where they
describe the optimizations done in tree_ssa_loop.c, which I assume is what
is controlled here. Some of them sound useful. But it also looks like a
case where high-level optimizations aimed at processors with vectorization
capabilities (?) are making it difficult for code generators on smaller
processors with the usual instruction sets to generate good code. Is there
anything that can be done? Can vectorizing optimizations (if they turn out
to be guilty) be turned off by processors that don't have any vectorization
ability?

Full source, intermediate, object, and list files on google docs.
https://docs.google.com/file/d/0B6dMB5dovDUZRlhzdlZWTk9mTWc/edit?usp=sharing

(FWIW, I get the same sort of non-optimal obfuscation using the msp430-gcc compiler,

which is also based on 4.6.x)

From:	Bill Westfield
Subject:	[avr-gcc-list] avr-gcc sub-optimal code with -ftree-loop-optimize - fixable?
Date:	Fri, 29 Mar 2013 21:28:31 -0700

[avr-gcc-list] avr-gcc sub-optimal code with -ftree-loop-optimize - fixa