[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [avr-gcc-list] 16 Bit Store Optimizations
From: |
Georg-Johann Lay |
Subject: |
Re: [avr-gcc-list] 16 Bit Store Optimizations |
Date: |
Wed, 21 Dec 2016 11:45:24 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
On 19.12.2016 22:24, Thomas Watson wrote:
Hello all,
A frequent need in my code is to combine two 8 bit variables into a 16 bit
variable. I am trying to determine the optimal way to do this. The naïve way
and a more clever way both generate extra instructions that could be optimized
away. I include a test case and comments which explain the setup and issue in
more detail. It seems this is a missed opportunity for optimization in the
compiler.
Thomas
#include <avr/io.h>
#include <inttypes.h>
#include <avr/interrupt.h>
/*
$ avr-gcc -v
Using built-in specs.
COLLECT_GCC=avr-gcc
COLLECT_LTO_WRAPPER=/usr/local/Cellar/avr-gcc/4.9.2/libexec/gcc/avr/4.9.2/lto-wrapper
Target: avr
Configured with: ../configure --enable-languages=c,c++ --target=avr
--disable-libssp --disable-nls --with-dwarf2
--prefix=/usr/local/Cellar/avr-gcc/4.9.2
--with-gmp=/usr/local/Cellar/gmp/6.0.0a
--with-mpfr=/usr/local/Cellar/mpfr/3.1.2-p10
--with-mpc=/usr/local/Cellar/libmpc/1.0.2
--datarootdir=/usr/local/Cellar/avr-gcc/4.9.2/share
--bindir=/usr/local/Cellar/avr-gcc/4.9.2/bin --with-as=/usr/local/bin/avr-as
--with-ld=/usr/local/bin/avr-ld
Thread model: single
gcc version 4.9.2 (GCC)
*/
// compile like:
// avr-gcc -mmcu=atmega328p -std=gnu99 -Os -Wall -DF_CPU=16000000
-Wa,-ahlmsd=sixteenbit_test.lst -o sixteenbit_test.elf sixteenbit_test.c
// data storage memory (might be used in ISR for example)
volatile uint16_t data;
// read and return a byte from the serial port
uint8_t read_byte() {
while (!(UCSR0A & _BV(RXC0)));
return (uint8_t)UDR0;
}
int main() {
cli();
// init serial port
// etc
sei();
uint8_t temp_hi, temp_lo;
// receive a word with |
temp_hi = read_byte();
temp_lo = read_byte();
cli(); // not okay to get interrupted while assigning
// in case an ISR comes and tries to read 'data'
/* compiles to
47 0012 282F mov r18,r24
48 0014 30E0 ldi r19,0
49 0016 C901 movw r24,r18
50 0018 9C2B or r25,r28
51 001a 9093 0000 sts data+1,r25
52 001e 8093 0000 sts data,r24
where r28 is temp_hi and r24 is temp_lo
8 bytes and 4 cycles worse than the good solution
*/
data = (temp_hi<<8) | temp_lo;
Short answer: This are around 5 operations:
- promote temp_hi to int
- promote temp_lo to int
- shift int left by 8
- or'ing two int values
- assign 16-bit result to data
Long answer:
This is known for some time now, cf.
http://gcc.gnu.org/PR41076
The problem is that 16-bit operations cannot just be split
into 8-bit operations because that usually leads to code bloat,
and splitting before register allocation will come up with
quite some 8-bit subregs of the 16-bit containers, which
would result in unpleasant register allocation and more
moves all over the place.
So one option is to add ugly post reg-alloc split patterns like
r242907 that split the combined instructions after register
allocation. This gives some improvement but still no perfect code.
In particular, in order to store to a 16-bit value two registers
which are /not/ a 16-bit value (but 2 free floating 8-bit values)
the stores would also have to be split. However, such a splitter
would never touch accesses with side effects (volatile) as in
your example.
The latest change is upstream since 2016-11-28 to trunk, which
is future v7 (released around spring 2017).
https://gcc.gnu.org/r242907
Feeding the following test case into respective compiler
#include <stdint.h>
#include <avr/interrupt.h>
extern volatile uint16_t data;
// read and return a byte from the serial port
extern uint8_t read_byte(void);
void test1 (void)
{
uint8_t temp_hi, temp_lo;
// receive a word with |
temp_hi = read_byte();
temp_lo = read_byte();
cli();
data = (temp_hi<<8) | temp_lo;
sei();
}
and then
$ avr-gcc-7 foo.c -mmcu=atmega328 -dp -save-temps -Os
spits out the following foo.s assembly output:
test1:
push r28 ; 22 pushqi1/1 [length = 1]
call read_byte ; 5 call_value_insn/2 [length = 2]
mov r28,r24 ; 6 movqi_insn/1 [length = 1]
call read_byte ; 7 call_value_insn/2 [length = 2]
/* #APP */
cli
/* #NOAPP */
mov r25,r28 ; 20 movqi_insn/1 [length = 1]
sts data+1,r25 ; 14 *movhi/4 [length = 4]
sts data,r24
/* #APP */
sei
/* #NOAPP */
pop r28 ; 25 popqi [length = 1]
ret ; 26 return_from_epilogue [length = 1]
Which is still not optimal due to insn 20 which is the last bit
of composing two 8-bit values into one 16-bit value for storage
(insn 14).
sei(); // can get interrupted again
// receive a word with pointer assignment
// ugly and still does not compile how I want
temp_hi = read_byte();
temp_lo = read_byte();
cli(); // not okay to get interrupted while assigning
// in case an ISR comes and tries to read 'data'
/* compiles to
66 0030 E0E0 ldi r30,lo8(data)
67 0032 F0E0 ldi r31,hi8(data)
68 0034 8083 st Z,r24
69 0036 C183 std Z+1,r28
2 cycles worse than the good solution
*/
*((uint8_t*)&data) = temp_lo;
*((uint8_t*)&data+1) = temp_hi;
This is throwing away the volatile qualification. And
type-punning through pointers is something that's really
discouraged.
If it's of utmost importance that a specific sequence is
generated, you can use inline assembler any wrap it into
a function for encapsulation and re-usage. With the same
headers, declarations and compilation as above:
static inline __attribute__((__always_inline__))
void atomic_store16 (uint16_t volatile *addr, uint8_t lo, uint8_t hi)
{
cli();
__asm volatile ("sts %3+1,%[hi]" "\n\t"
"sts %3,%[lo]" "\n\t"
: "=m" (*addr)
: [hi] "r" (hi), [lo] "r" (lo), "i" (addr));
sei();
}
void test4 (void)
{
uint8_t temp_hi = read_byte();
atomic_store16 (&data, read_byte(), temp_hi);
}
Which avr-gcc compiles to
test4:
push r28 ; 15 pushqi1/1 [length = 1]
call read_byte ; 5 call_value_insn/2 [length = 2]
mov r28,r24 ; 6 movqi_insn/1 [length = 1]
call read_byte ; 7 call_value_insn/2 [length = 2]
/* #APP */
cli
sts data+1,r28
sts data,r24
sei
/* #NOAPP */
pop r28 ; 18 popqi [length = 1]
ret ; 19 return_from_epilogue [length = 1]
The problem is the ugly, non-portable code and that it might
hiccup with -O0 and depends on data living in static storage.
Yet another ugly solution is type-punning through a union
(which is explicitly supported by GCC) as follows:
static inline __attribute__((__always_inline__))
void atomic_union16 (uint16_t volatile *addr, uint8_t lo, uint8_t hi)
{
union { uint16_t word; uint8_t byte[2]; }
val = { .byte[0] = lo, .byte[1] = hi };
*addr = val.word;
}
void test5 (void)
{
uint8_t temp_hi = read_byte();
atomic_union16 (&data, read_byte(), temp_hi);
}
The code is the same as with test4 from above.
Johann
sei(); // can get interrupted again
/*
Why won't the compiler compile it as
sts data, r24
sts data+1, r28
It will, in the second case, if it feels Z and Y are otherwise occupied.
But I couldn't get to think that for a reasonably short sample.
It also will work correctly in this case for -O2.
*/
}