gforth
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gforth] Performance anomality with dynamic superinstructions on MIP


From: David Kuehling
Subject: Re: [gforth] Performance anomality with dynamic superinstructions on MIPSel
Date: Mon, 24 Mar 2014 04:36:42 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

>>>>> "Bernd" == Bernd Paysan <address@hidden> writes:

> I've looked at what ARM and x86_64 GCC do, and they also move in some
> stuff, x86_64 less, ARM more.  It's not as bad as your case (with an
> emulated function), but it's still stuff.  asm __volatile__ ("": :
> :"memory") doesn't prevent it.  Neither does calling a dummy function.

> What did the trick?  Using FIRST_NEXT actually in after_last:, this is
> a dummy for getting the tail of the last address, we can put anything
> we like there.  Doing FIRST_NEXT there makes it a noop, and since
> there's nothing to move into the goto, it stays as small as it should.

> On the Core i7, I see no difference (the two leas and the one write
> are swallowed by the sheer power of the Core i7), but on my Galaxy
> Note II, this gives a very clear and significant speedup:

>  0.575 0.710 0.365 0.750 0.390 2014-03-24; Exynos 4 Quad 1.6GHz;
> gcc-4.8.x (Android 4.3) 0.735 0.920 0.900 1.110 0.690 2012-10-31;
> Exynos 4 Quad 1.6GHz; gcc-4.6.x (Android 4.1.1)

It's getting better now, onebench reports speedups for most tests when
using --dynamic (git d930f495dbe357fe06c4c2):

  gforth-fast --dynamic ~/forth/gforth/onebench.fs 
   sieve bubble matrix   fib   fft
   1.808  1.532  0.876 2.016 1.412

vs.

  gforth-fast ~/forth/gforth/onebench.fs  sieve bubble matrix   fib   fft
   1.380  2.004  1.672 2.180 1.760

The code for non-primitive calls, looks correct now:

  $2BBA8C40 call
  $2BBA8C44 <test1> 
  ( $2BFDE768 ) 3 16 0 addu,
  ( $2BFDE76C ) 16 0 16 lw,
  ( $2BFDE770 ) 2 18 0 addu,
  ( $2BFDE774 ) 3 3 4 addiu,
  ( $2BFDE778 ) 18 18 -4 addiu,
  ( $2BFDE77C ) 16 16 4 addiu,
  ( $2BFDE780 ) 3 -4 2 sw,
  ( $2BFDE784 ) 2 -4 16 lw,
  ( $2BFDE788 ) 3 2 0 addu,
  ( $2BFDE78C ) 3 jr,
  ( $2BFDE790 ) 1 1 0 or,
  $2BBA8C48 ;s ok


Something strange is happening for sieve.fs.  According to see-code, it
holds a lot of inlined copies of NOOP (but that could be a problem with
see-code, the code looks legitimate, most NOOPs look like literals or R@
or DUP)

gforth-fast --dynamic ~/forth/gforth/sieve.fs

  see-code primes 
  $2B09CA80 noop
  $2B09CA84 <FLAGS> 
  ( $2B4D2314 ) 21 0 17 sw,
  ( $2B4D2318 ) 17 17 -4 addiu,
  ( $2B4D231C ) 21 0 16 lw,
  ( $2B4D2320 ) 16 16 8 addiu,
  $2B09CA88 noop
  $2B09CA8C <8190> 
  ( $2B4D2324 ) 21 0 17 sw,
  ( $2B4D2328 ) 17 17 -4 addiu,
  ( $2B4D232C ) 21 0 16 lw,
  ( $2B4D2330 ) 16 16 8 addiu,
  $2B09CA90 noop
  $2B09CA94 <1> 
  $2B09CA98 fill
  ( $2B4D2334 ) 21 0 17 sw,
  ( $2B4D2338 ) 17 17 -4 addiu,
  ( $2B4D233C ) 21 0 16 lw,
  ( $2B4D2340 ) 16 16 8 addiu,
  ( $2B4D2344 ) 2 -4 16 lw,
  ( $2B4D2348 ) 3 2 0 addu,
  ( $2B4D234C ) 3 jr,
  ( $2B4D2350 ) 1 1 0 or,
  $2B09CA9C noop
  $2B09CAA0 <0> 
  ( $2B4D2354 ) 21 0 17 sw,
  ( $2B4D2358 ) 17 17 -4 addiu,
  ( $2B4D235C ) 21 0 16 lw,
  ( $2B4D2360 ) 16 16 8 addiu,
  $2B09CAA4 noop
  $2B09CAA8 <3> 
  ( $2B4D2364 ) 21 0 17 sw,
  ( $2B4D2368 ) 17 17 -4 addiu,
  ( $2B4D236C ) 21 0 16 lw,
  ( $2B4D2370 ) 16 16 8 addiu,
  $2B09CAAC noop
  $2B09CAB0 <722061894> 
  ( $2B4D2374 ) 21 0 17 sw,
  ( $2B4D2378 ) 17 17 -4 addiu,
  ( $2B4D237C ) 21 0 16 lw,
  ( $2B4D2380 ) 16 16 8 addiu,
  $2B09CAB4 noop
  $2B09CAB8 <FLAGS> 
  ( $2B4D2384 ) 21 0 17 sw,
  ( $2B4D2388 ) 17 17 -4 addiu,
  ( $2B4D238C ) 21 0 16 lw,
  ( $2B4D2390 ) 16 16 8 addiu,
  $2B09CABC (do)
  ( $2B4D2394 ) 2 4 17 lw,
  ( $2B4D2398 ) 21 -8 18 sw,
  ( $2B4D239C ) 2 -4 18 sw,
  ( $2B4D23A0 ) 21 8 17 lw,
  ( $2B4D23A4 ) 18 18 -8 addiu,
  ( $2B4D23A8 ) 17 17 8 addiu,
  ( $2B4D23AC ) 16 16 4 addiu,
  $2B09CAC0 noop
  ( $2B4D23B0 ) 21 0 17 sw,
  ( $2B4D23B4 ) 17 17 -4 addiu,
  ( $2B4D23B8 ) 21 0 18 lw,
  ( $2B4D23BC ) 16 16 4 addiu,
  $2B09CAC4 c@
  ( $2B4D23C0 ) 21 0 21 lbu,
  ( $2B4D23C4 ) 16 16 4 addiu,
  $2B09CAC8 ?branch
  $2B09CACC <722062140> 
  ( $2B4D23C8 ) 3 17 0 addu,
  ( $2B4D23CC ) 2 0 16 lw,
  ( $2B4D23D0 ) 21 0 28 bne,
  ( $2B4D23D4 ) 17 17 4 addiu,
  ( $2B4D23D8 ) 16 2 4 addiu,
  ( $2B4D23DC ) 2 -4 16 lw,
  ( $2B4D23E0 ) 21 4 3 lw,
  ( $2B4D23E4 ) 3 2 0 addu,
  ( $2B4D23E8 ) 3 jr,
  ( $2B4D23EC ) 1 1 0 or,
  ( $2B4D23F0 ) 21 4 3 lw,
  ( $2B4D23F4 ) 16 16 8 addiu,
  $2B09CAD0 noop
  ( $2B4D23F8 ) 21 0 17 sw,
  ( $2B4D23FC ) 17 17 -4 addiu,
  ( $2B4D2400 ) 21 4 17 lw,
  ( $2B4D2404 ) 16 16 4 addiu,
  $2B09CAD4 noop
  ( $2B4D2408 ) 21 0 17 sw,
  ( $2B4D240C ) 17 17 -4 addiu,
  ( $2B4D2410 ) 21 0 18 lw,
  ( $2B4D2414 ) 16 16 4 addiu,
  $2B09CAD8 +
  ( $2B4D2418 ) 2 4 17 lw,
  ( $2B4D241C ) 16 16 4 addiu,
  ( $2B4D2420 ) 17 17 4 addiu,
  ( $2B4D2424 ) 21 2 21 addu,
  $2B09CADC noop
  ( $2B4D2428 ) 21 0 17 sw,
  ( $2B4D242C ) 17 17 -4 addiu,
  ( $2B4D2430 ) 21 4 17 lw,
  ( $2B4D2434 ) 16 16 4 addiu,
  $2B09CAE0 noop
  $2B09CAE4 <722061894> 
  ( $2B4D2438 ) 21 0 17 sw,
  ( $2B4D243C ) 17 17 -4 addiu,
  ( $2B4D2440 ) 21 0 16 lw,
  ( $2B4D2444 ) 16 16 8 addiu,
  $2B09CAE8 <
  ( $2B4D2448 ) 2 4 17 lw,
  ( $2B4D244C ) 16 16 4 addiu,
  ( $2B4D2450 ) 21 2 21 slt,
  ( $2B4D2454 ) 17 17 4 addiu,
  ( $2B4D2458 ) 21 0 21 subu,
  $2B09CAEC ?branch
  $2B09CAF0 <722062124> 
  ( $2B4D245C ) 3 17 0 addu,
  ( $2B4D2460 ) 2 0 16 lw,
  ( $2B4D2464 ) 21 0 28 bne,
  ( $2B4D2468 ) 17 17 4 addiu,
  ( $2B4D246C ) 16 2 4 addiu,
  ( $2B4D2470 ) 2 -4 16 lw,
  ( $2B4D2474 ) 21 4 3 lw,
  ( $2B4D2478 ) 3 2 0 addu,
  ( $2B4D247C ) 3 jr,
  ( $2B4D2480 ) 1 1 0 or,
  ( $2B4D2484 ) 21 4 3 lw,
  ( $2B4D2488 ) 16 16 8 addiu,
  $2B09CAF4 noop
  $2B09CAF8 <722061894> 
  ( $2B4D248C ) 21 0 17 sw,
  ( $2B4D2490 ) 17 17 -4 addiu,
  ( $2B4D2494 ) 21 0 16 lw,
  ( $2B4D2498 ) 16 16 8 addiu,
  $2B09CAFC swap
  ( $2B4D249C ) 3 4 17 lw,
  ( $2B4D24A0 ) 16 16 4 addiu,
  ( $2B4D24A4 ) 21 4 17 sw,
  ( $2B4D24A8 ) 21 3 0 addu,
  $2B09CB00 (do)
  ( $2B4D24AC ) 2 4 17 lw,
  ( $2B4D24B0 ) 21 -8 18 sw,
  ( $2B4D24B4 ) 2 -4 18 sw,
  ( $2B4D24B8 ) 21 8 17 lw,
  ( $2B4D24BC ) 18 18 -8 addiu,
  ( $2B4D24C0 ) 17 17 8 addiu,
  ( $2B4D24C4 ) 16 16 4 addiu,
  $2B09CB04 noop
  $2B09CB08 <0> 
  ( $2B4D24C8 ) 21 0 17 sw,
  ( $2B4D24CC ) 17 17 -4 addiu,
  ( $2B4D24D0 ) 21 0 16 lw,
  ( $2B4D24D4 ) 16 16 8 addiu,
  $2B09CB0C noop
  ( $2B4D24D8 ) 21 0 17 sw,
  ( $2B4D24DC ) 17 17 -4 addiu,
  ( $2B4D24E0 ) 21 0 18 lw,
  ( $2B4D24E4 ) 16 16 4 addiu,
  $2B09CB10 c!
  ( $2B4D24E8 ) 3 4 17 lbu,
  ( $2B4D24EC ) 2 17 0 addu,
  ( $2B4D24F0 ) 3 0 21 sb,
  ( $2B4D24F4 ) 21 8 2 lw,
  ( $2B4D24F8 ) 17 17 8 addiu,
  ( $2B4D24FC ) 16 16 4 addiu,
  $2B09CB14 noop
  ( $2B4D2500 ) 21 0 17 sw,
  ( $2B4D2504 ) 17 17 -4 addiu,
  ( $2B4D2508 ) 21 4 17 lw,
  ( $2B4D250C ) 16 16 4 addiu,
  $2B09CB18 (+loop)
  $2B09CB1C <722062084> 
  ( $2B4D2510 ) 3 0 18 lw,
  ( $2B4D2514 ) 5 4 18 lw,
  ( $2B4D2518 ) 4 17 0 addu,
  ( $2B4D251C ) 5 3 5 subu,
  ( $2B4D2520 ) 6 5 21 addu,
  ( $2B4D2524 ) 6 6 5 xor,
  ( $2B4D2528 ) 5 5 21 xor,
  ( $2B4D252C ) 5 6 5 and,
  ( $2B4D2530 ) 7 0 16 lw,
  ( $2B4D2534 ) 17 17 4 addiu,
  ( $2B4D2538 ) 5 32 bltz,
  ( $2B4D253C ) 3 3 21 addu,
  ( $2B4D2540 ) 3 0 18 sw,
  ( $2B4D2544 ) 16 7 4 addiu,
  ( $2B4D2548 ) 2 -4 16 lw,
  ( $2B4D254C ) 21 4 4 lw,
  ( $2B4D2550 ) 3 2 0 addu,
  ( $2B4D2554 ) 3 jr,
  ( $2B4D2558 ) 1 1 0 or,
  ( $2B4D255C ) 3 0 18 sw,
  ( $2B4D2560 ) 21 4 4 lw,
  ( $2B4D2564 ) 16 16 8 addiu,
  $2B09CB20 unloop
  ( $2B4D2568 ) 18 18 8 addiu,
  ( $2B4D256C ) 16 16 4 addiu,
  $2B09CB24 branch
  $2B09CB28 <722062128> 
  ( $2B4D2570 ) 16 0 16 lw,
  ( $2B4D2574 ) 16 16 4 addiu,
  ( $2B4D2578 ) 2 -4 16 lw,
  ( $2B4D257C ) 3 2 0 addu,
  ( $2B4D2580 ) 3 jr,
  ( $2B4D2584 ) 1 1 0 or,
  $2B09CB2C drop
  ( $2B4D2588 ) 21 4 17 lw,
  ( $2B4D258C ) 16 16 4 addiu,
  ( $2B4D2590 ) 17 17 4 addiu,
  $2B09CB30 swap
  ( $2B4D2594 ) 3 4 17 lw,
  ( $2B4D2598 ) 16 16 4 addiu,
  ( $2B4D259C ) 21 4 17 sw,
  ( $2B4D25A0 ) 21 3 0 addu,
  $2B09CB34 1+
  ( $2B4D25A4 ) 16 16 4 addiu,
  ( $2B4D25A8 ) 21 21 1 addiu,
  $2B09CB38 swap
  ( $2B4D25AC ) 3 4 17 lw,
  ( $2B4D25B0 ) 16 16 4 addiu,
  ( $2B4D25B4 ) 21 4 17 sw,
  ( $2B4D25B8 ) 21 3 0 addu,
  $2B09CB3C lit
  $2B09CB40 <2> 
  $2B09CB44 <0> 
  ( $2B4D25BC ) 2 0 16 lw,
  ( $2B4D25C0 ) 16 16 12 addiu,
  ( $2B4D25C4 ) 21 2 21 addu,
  $2B09CB48 (loop)
  $2B09CB4C <722062016> 
  ( $2B4D25C8 ) 3 0 18 lw,
  ( $2B4D25CC ) 4 4 18 lw,
  ( $2B4D25D0 ) 3 3 1 addiu,
  ( $2B4D25D4 ) 3 4 28 beq,
  ( $2B4D25D8 ) 5 0 16 lw,
  ( $2B4D25DC ) 3 0 18 sw,
  ( $2B4D25E0 ) 16 5 4 addiu,
  ( $2B4D25E4 ) 2 -4 16 lw,
  ( $2B4D25E8 ) 3 2 0 addu,
  ( $2B4D25EC ) 3 jr,
  ( $2B4D25F0 ) 1 1 0 or,
  ( $2B4D25F4 ) 16 16 8 addiu,
  ( $2B4D25F8 ) 3 0 18 sw,
  $2B09CB50 unloop
  ( $2B4D25FC ) 18 18 8 addiu,
  ( $2B4D2600 ) 16 16 4 addiu,
  $2B09CB54 drop
  ( $2B4D2604 ) 21 4 17 lw,
  ( $2B4D2608 ) 16 16 4 addiu,
  ( $2B4D260C ) 17 17 4 addiu,
  $2B09CB58 ;s
  $2B09CB5C <538976288>  ok

Without --dynamic the code looks better:

  see-code primes 
  $2B0A8A80 lit
  $2B0A8A84 <FLAGS> 
  $2B0A8A88 lit
  $2B0A8A8C <8190> 
  $2B0A8A90 lit
  $2B0A8A94 <1> 
  $2B0A8A98 fill
  $2B0A8A9C lit
  $2B0A8AA0 <0> 
  $2B0A8AA4 lit
  $2B0A8AA8 <3> 
  $2B0A8AAC lit
  $2B0A8AB0 <722111046> 
  $2B0A8AB4 lit
  $2B0A8AB8 <FLAGS> 
  $2B0A8ABC (do)
  $2B0A8AC0 i
  $2B0A8AC4 c@
  $2B0A8AC8 ?branch
  $2B0A8ACC <722111292> 
  $2B0A8AD0 dup
  $2B0A8AD4 i
  $2B0A8AD8 +
  $2B0A8ADC dup
  $2B0A8AE0 lit
  $2B0A8AE4 <722111046> 
  $2B0A8AE8 <
  $2B0A8AEC ?branch
  $2B0A8AF0 <722111276> 
  $2B0A8AF4 lit
  $2B0A8AF8 <722111046> 
  $2B0A8AFC swap
  $2B0A8B00 (do)
  $2B0A8B04 lit
  $2B0A8B08 <0> 
  $2B0A8B0C i
  $2B0A8B10 c!
  $2B0A8B14 dup
  $2B0A8B18 (+loop)
  $2B0A8B1C <722111236> 
  $2B0A8B20 unloop
  $2B0A8B24 branch
  $2B0A8B28 <722111280> 
  $2B0A8B2C drop
  $2B0A8B30 swap
  $2B0A8B34 1+
  $2B0A8B38 swap
  $2B0A8B3C <4229160> 
  $2B0A8B40 <2> 
  $2B0A8B44 <0> 
  $2B0A8B48 (loop)
  $2B0A8B4C <722111168> 
  $2B0A8B50 unloop
  $2B0A8B54 drop
  $2B0A8B58 ;s
  $2B0A8B5C <538976288>  ok

It's difficult to explain the slowdown of sieve.fs with --dynamic:
I-cache on loongson2f is 64k and should be sufficient to hold a full
copy of the benchmark code.  Do dynamic superinstructions disable some
of the peephole optimizations?  

cheers,

David
-- 
GnuPG public key: http://dvdkhlng.users.sourceforge.net/dk2.gpg
Fingerprint: B63B 6AF2 4EEB F033 46F7  7F1D 935E 6F08 E457 205F

Attachment: pgpqlUJWBDfXS.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]