avr-libc-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [avr-libc-dev] Request: gcrt1.S with empty section .init9


From: Marko Mäkelä
Subject: Re: [avr-libc-dev] Request: gcrt1.S with empty section .init9
Date: Sat, 7 Jan 2017 18:55:32 +0200
User-agent: NeoMutt/20161126 (1.7.1)

Hallo Johann,

Actually I write C and C++ code bigger systems for living.

Ya, but on such systems you won't dive into generated assembly and propose library changes when you come across a pair of instructions you don't need in your specific context :-P

True. Also, in big software systems consisting of multiple subsystems maintained by separate organizations, the amount of bloat is at a completely different level. I would typically not look at the generated machine code except when it shows up in a profiler.

In a hobby project, I am free to optimize every single bit of performance. I brought up the call to main, because I think that the 2 wasted stack bytes can be significant when targeting a small unit, such as the ATtiny2313 with 128 bytes of SRAM. My biggest AVR program so far was an interrupt-driven interface adapter that uses only 4 bytes of stack (it'd strictly only need 2, but I did not figure out a trick), using the remaining 124 bytes of RAM for buffers.

Many developers find llvm more attractive than gcc because it is not GPL and the newer code "pure doctrine" C++, whereas gcc might deliberately use macros (for host performance) which many developers find disgusting.

I do not think that macros are necessarily faster than inline functions. It may be true for GCC, but with clang the opposite can hold. I recently rewrote a puzzle solver in C++14, and to my surprise, clang generated slightly faster code for the C++ than GCC did for the C where I had used macros. I am hoping to run the code on an AVR some day, just to see how fast it would do the 64-bit math:

http://www.iki.fi/~msmakela/software/pentomino/

When you are coding a backend, it hardly matters whether you write XX_FOO (y) or xx.foo (y), what's paramount is that you are able to express what you want to express and get your job done w.r.t. target features, target code performance, etc.

Very true. As far as I understand, in avr-gcc there are some difficult-to-change design limitations with regard to what can be expressed. I have already encountered the inability to preserve a value in r0 until some some code really needs the register, and the inability to precisely track which registers need to be saved and restored in a function, i.e. PR20296.

I am not claiming that LLVM is better, but given that it is different, maybe these particular problems can be solved.

So far I found the generated code surprisingly good. I feared that GCC would target a ‘virtual machine’ with 32-bit registers, but that does

GCC targets a target, and the description should match the real hardware as close as it can :-)

I think that GCC (or any compiler for that matter) targets something that resides above the bare metal. There are layers of constraints and assumptions in the form of ABI (mainly calling conventions) and run-time library. Each of these layers (including a possible operating system) could also be thought of as lightweight virtual machine residing above the previous layer. The bare metal would be the lowest layer.

My only complaint is that avr-gcc does not allow me to assign a pointer to the Z register even when my simple program does not need that register for anything else:

register const __flash char* usart_tx_next asm("r28"); // better: r30:31

This is not the Z pointer, R28 is the Y register, which in turn might be
the frame pointer.  Even if avr-gcc allowed to reserve it globally, you
would get non-functional code.  Same with reserving Z.

I did get properly working code for the above with -O3 and -Os, but admittedly, maybe it would not work in a bigger program where some function call is not inlined. If I used "r30" instead, the program would refuse to compile.

__flash will try to access via Z, and if you take that register away by fixing it then the compiler will no more be able to use Z for its job.

My very reason for attempting to reserve Z for this pointer was that it is the only __flash pointer in the program.

My strong impression is that you are inventing hacks to push the compiler into generating the exact same sequence as you would write down as a smart assembler programmer. Don't do it. You will come up with code that it cluttered up with hard-to-maintain kludges or it might even be non-functional (as with globally reserving registers indispensable to the compiler).

I admit I am trying to find and push the limits with these experiments. I would not use these tricks in a program that is intended to be portable.

You could write that ISR and avoid push / pop SREG by using CPSE, but
that needs asm, of course. Cf. PR20296.  Everybody familiar with avr
is aware of this, but also aware that it will be quite some work to
come up with optimal code.  The general recommendation is to use
assembler if you need specific instruction sequences.

Yes, it seems that small interrupt handlers are indeed better written in assembler. However, given that avr-gcc does not let me to reserve the Z register pair, I would still have to save and restore Z in the assembler code so that it can use the LPM instruction.

Even if the compiler generated code that's close to optimal, it would be very hard to force CPSE and block any other comparison instructions provided respective code exists.

Right. It would be an additional constraint for the compiler to try to avoid generating instructions that affect SREG. It is possible but tricky, and in the end some code might end up forcing the SREG to be saved and restored anyway.

In a small embedded system with at most some kilobytes to hundreds of kilobytes of instruction space, I think that it might be worthwhile to compile the whole program at once, instead of linking separately compiled compilation units together. This would allow additional whole-program optimizations and warnings, such as detecting possible stack overflow.

On the 6502, where the ALU instructions can work directly with memory operands and where the stack pointer is only 8 bits, it would be beneficial to statically allocate RAM locations to the local variables in non-recursive procedure calls, to save the precious stack address space. This is only possible with whole-program optimization.

On the AVR, maybe a whole-program optimization could assign the most commonly used global or static variables to registers.

        Marko



reply via email to

[Prev in Thread] Current Thread [Next in Thread]