[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-gcc-list] Speed challenge...
From: |
Peter N Lewis |
Subject: |
Re: [avr-gcc-list] Speed challenge... |
Date: |
Fri, 26 Apr 2002 19:52:35 +0800 |
Hi. I have a challenge. The (working) code below outputs 19 bits
serially from a 4433 uC to another chip using data, clock, and
strobe lines. The data comes from 3 lookup arrays of 256 bytes each.
At 4MHz on a 4433 uC , it takes about 87us to send all 19 bits. I
wonder if there is a faster way of doing the job?
Ok, well, firstly, the code is broken in to three identical chunks
(clearly you block copied them given the comment "send first 8 bits"
is repeated three times ;-). So look at just one chunk and optimize
that.
for (i=0; i<=7; i++) { // send first 8 bits
__cbi(PORTD,5); // set clock line low
if (mask & data) // present data
__sbi(PORTD,4);
else
__cbi(PORTD,4);
__sbi(PORTD,5); // clock the data bit
mask >>= 1;
}
Compiles (with -O2) to:
.L9:
/* #APP */
cbi 18,5
/* #NOAPP */
mov r24,r18
and r24,r19
brne _PC_+2
rjmp .L7
/* #APP */
sbi 18,4
/* #NOAPP */
.L8:
/* #APP */
sbi 18,5
/* #NOAPP */
lsr r18
subi r25,lo8(-(1))
cpi r25,lo8(8)
brlo .L9
.L7:
/* #APP */
cbi 18,4
/* #NOAPP */
rjmp .L8
Several things spring to mind immediately. The if else is causing a
lot of jumps which is probably not helping. And cbi/sbi take 2
cycles as compared to a single out.
Cycle counts:
1 cycle: out,mov,and,lsr,subi,cpi
2 cycles: cbi,sbi,rjmp
brne: 1 if branch not taken, 2 if branch taken
So currently your loop takes (assuming 4 on and 4 off bits) around
2 + 1 + 1 + average(1 + 2 + 2 + 2, 2 + 2) + 2 + 1 + 1 + 1 + 2
16.5 cycles per bit, 313 cycles per 19 bits which at 4 MHz is 78uS,
about in line with your timing.
So how to improve it. Each cycle in the loop saved will save roughly
1/16 of the time.
We can get rid of the index counter by using the mask
do { // send first 8 bits
} while ( mask >>= 1 );
Next, we need to get rid of the else and the cbi/sbis. We can do
this by remembering the value of PORTD in a variable "u08 d;". Just
before the first loop, initialize it with "d = inp(PORTD);", and then:
do { // send first 8 bits
d &= ~(BV(5)|BV(4)); // clear bit 4 & 5
if (data & mask) {
d |= BV(4);
}
outb(d, PORTD ); // set clock line low and present the data
d |= BV(5);
outb(d, PORTD ); // clock the data bit
} while ( mask >>= 1 );
One consequence of this change is it assumes the data is read on the
rising edge of the clock bit. If the data is read "while the clock
is high", this code would not work. The assembly now looks like:
.L3:
andi r25,lo8(-49)
mov r24,r19
and r24,r18
breq .L6
ori r25,lo8(16)
.L6:
/* #APP */
out 18,r25
/* #NOAPP */
ori r25,lo8(32)
/* #APP */
out 18,r25
/* #NOAPP */
lsr r18
brne .L3
And takes 11 cycles per bit (saving about 26uS).
Unrolling the loop would save two cycles per bit.
Converting to assembler could probably save a couple more cycles.
HTH,
Peter.
--
<http://www.interarchy.com/> <ftp://ftp.interarchy.com/interarchy.hqx>
avr-gcc-list at http://avr1.org
- Re: [avr-gcc-list] Speed challenge...,
Peter N Lewis <=