qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coh


From: Nicholas Piggin
Subject: Re: [PATCH] util: optimise flush_idcache_range when the ppc host has coherent icache
Date: Fri, 20 May 2022 10:04:04 +1000

Excerpts from Richard Henderson's message of May 20, 2022 4:31 am:
> On 5/19/22 07:11, Nicholas Piggin wrote:
>> dcache writeback and icache invalidate is not required when icache is
>> coherent, a shorter fixed-length sequence can be used which just has to
>> flush and re-fetch instructions that were in-flight.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> 
>> I haven't been able to measure a significant performance difference
>> with this, qemu isn't flushing large ranges frequently so the old sequence
>> is not that slow.
> 
> Yeah, we should be flushing smallish regions (< 1-4k), as we generate 
> TranslationBlocks. 
> And hopefully the translation cache is large enough that we spend more time 
> executing 
> blocks than re-compiling them.  ;-)
> 
> 
>> +++ b/include/qemu/cacheflush.h
>> @@ -28,6 +28,10 @@ static inline void flush_idcache_range(uintptr_t rx, 
>> uintptr_t rw, size_t len)
>>   
>>   #else
>>   
>> +#if defined(__powerpc__)
>> +extern bool have_coherent_icache;
>> +#endif
> 
> Ug.  I'm undecided where to put this.  I'm tempted to say...
> 
>> --- a/util/cacheflush.c
>> +++ b/util/cacheflush.c
>> @@ -108,7 +108,16 @@ void flush_idcache_range(uintptr_t rx, uintptr_t rw, 
>> size_t len)
> 
> ... here in cacheflush.c, with a comment that the variable is defined and 
> initialized in 
> cacheinfo.c.
> 
> I'm even more tempted to merge the two files to put all of the 
> machine-specific cache data 
> in the same place, then this variable can be static.  There's even an 
> existing TODO 
> comment in cacheflush.c for aarch64.

That might be nice. Do you want me to look at doing that first?

>>       b = rw & ~(dsize - 1);
>> +
>> +    if (have_coherent_icache) {
>> +        asm volatile ("sync" : : : "memory");
>> +        asm volatile ("icbi 0,%0" : : "r"(b) : "memory");
>> +        asm volatile ("isync" : : : "memory");
>> +        return;
>> +    }
> 
> Where can I find definitive rules on this?

In processor manuals (I don't know if there are any notes about this in 
the ISA, I would be tempted to say there should be since many processors
implement it).

POWER9 UM, 4.6.2.2 Instruction Cache Block Invalidate (icbi) 

https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k

> Note that rx may not equal rw, and that we've got two virtual mappings for 
> the same 
> memory, one for "data" that is read-write and one for "execute" that is 
> read-execute. 
> (This split is enabled only for --enable-debug-tcg builds on linux, to make 
> sure we don't 
> regress apple m1, which requires the split all of the time.)
> 
> In particular, you're flushing one icache line with the dcache address, and 
> that you're 
> not flushing any of the other lines.  Is the coherent icache thing really 
> that we may 
> simply skip the dcache flush step, but must still flush all of the icache 
> lines?

Yeah it's just a funny sequence the processor implements. It treats icbi 
almost as a no-op except that it sets a flag such that the next isync 
will flush and refetch the pipeline. It doesn't do any cache flushing.

> Without docs, "icache snoop" to me would imply that we only need the two 
> barriers and no 
> flushes at all, just to make sure all memory writes complete before any new 
> instructions 
> are executed.  This would be like the two AArch64 bits, IDC and DIC, which 
> indicate that 
> the two caches are coherent to Point of Unification, which leaves us with 
> just the 
> Instruction Sequence Barrier at the end of the function.
> 
> 
>> +bool have_coherent_icache = false;
> 
> scripts/checkpatch.pl should complain this is initialized to 0.
> 
> 
>>   static void arch_cache_info(int *isize, int *dsize)
>>   {
>> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
>> +    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>> +#  endif
>> +
>>       if (*isize == 0) {
>>           *isize = qemu_getauxval(AT_ICACHEBSIZE);
>>       }
>>       if (*dsize == 0) {
>>           *dsize = qemu_getauxval(AT_DCACHEBSIZE);
>>       }
>> +
>> +#  ifdef PPC_FEATURE_ICACHE_SNOOP
>> +    have_coherent_icache = (hwcap & PPC_FEATURE_ICACHE_SNOOP) != 0;
>> +#  endif
> 
> Better with only one ifdef, moving this second hunk up.

Will clean those bits up, thanks.

> It would be nice if there were some kernel documentation for this...

arm64 has kernel docs for hwcaps... powerpc probably should as well.
Good point, I might do a patch for that.

Thanks,
Nick



reply via email to

[Prev in Thread] Current Thread [Next in Thread]