Re: [PATCH v2 0/9] mips: Fill delay slots v3

lightning
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 0/9] mips: Fill delay slots v3

From:	Paulo César Pereira de Andrade
Subject:	Re: [PATCH v2 0/9] mips: Fill delay slots v3
Date:	Fri, 20 Jan 2023 16:38:09 -0300
Em sex., 20 de jan. de 2023 às 10:50, Paul Cercueil
<paul@crapouillou.net> escreveu:
>
> Hi Paulo,
>
> Le vendredi 20 janvier 2023 à 10:34 -0300, Paulo César Pereira de
> Andrade a écrit :
> > Em qua., 18 de jan. de 2023 às 10:27, Paulo César Pereira de Andrade
> > <paulo.cesar.pereira.de.andrade@gmail.com> escreveu:
> > >
> > > Em qua., 18 de jan. de 2023 às 09:42, Paul Cercueil
> > > <paul@crapouillou.net> escreveu:
> > > >
> > > > Hi Paulo,
> > >
> > >   Hi Paul,
> > >
> > > > Le dimanche 15 janvier 2023 à 09:54 -0300, Paulo César Pereira de
> > > > Andrade a écrit :
> > > > > Em dom., 15 de jan. de 2023 às 08:05, Paul Cercueil
> > > > > <paul@crapouillou.net> escreveu:
> > > > > >
> > > > > > Hi Paulo,
> > > > >
> > > > >   Hi Paul,
> > > > >
> > > > > > Le samedi 14 janvier 2023 à 22:42 -0300, Paulo César Pereira
> > > > > > de
> > > > > > Andrade
> > > > > > a écrit :
> > > > > > > Em sáb., 14 de jan. de 2023 às 12:11, Paul Cercueil
> > > > > > > <paul@crapouillou.net> escreveu:
> > > > > > > >
> > > > > > > > Hi Paulo,
> > > > > > >
> > > > > > >   Hi Paul,
> > > > > > >
> > > > > > >   Patches pushed as all tests pass.
> > > > > >
> > > > > > Great! ;)
> > > > > >
> > > > > > > > Here's the V3 of my patchset that attempts to fill branch
> > > > > > > > delay
> > > > > > > > slots on
> > > > > > > > MIPS.
> > > > > > > >
> > > > > > > > All tests do pass now, except the "float.nodata" check,
> > > > > > > > but
> > > > > > > > this
> > > > > > > > one
> > > > > > > > fails in master as well.
> > > > > > >
> > > > > > >   What environment are you using for testing? It works in
> > > > > > > the
> > > > > > > mips
> > > > > > > environments I am testing.
> > > > > > >
> > > > > > >   Please let me know the output of:
> > > > > > >
> > > > > > > $ cpp -dM </dev/null
> > > > > >
> > > > > > Just a mips64 toolchain built using buildroot.
> > > > > >
> > > > > > I believe you meant
> > > > > > cpp -dM -E - < /dev/null
> > > > >
> > > > >   It is a very recent gcc, and probably kernel. My guess is
> > > > > that the
> > > > > issue
> > > > > is caused by some kind of unaligned access. Is it real hardware
> > > > > or
> > > > > some
> > > > > emulator?
> > > > >
> > > > >   Can you please run it as:
> > > > >
> > > > > $ cd check
> > > > > $ make debug
> > > > > ...
> > > > > (gdb) r -d float.tst
> > > > >
> > > > > and see what if it is just somehow creating a bad immediate or
> > > > > if
> > > > > it is triggering some kind of fault, like misaligned or invalid
> > > > > instruction?
> > > >
> > > > This was under Qemu with my MIPS64 toolchain.
> > > >
> > > > However today as I try to reproduce it, all the checks do pass.
> > > > Maybe a
> > > > system update on my PC brought a fix (to qemu), or maybe it does
> > > > happen
> > > > only under some conditions.
> > > >
> > > > I'll debug it if I encounter it again.
> > > >
> > > > > > See attachment.
> > > > > >
> > > > > > > > I implemented your suggestion and it worked just fine.
> > > > > > > > There
> > > > > > > > was
> > > > > > > > one
> > > > > > > > last problem though, which caused the "clobber" check to
> > > > > > > > fail
> > > > > > > > once
> > > > > > > > again; code emitters that generated branches to the next
> > > > > > > > instruction (to conditionally execute some opcodes)
> > > > > > > > caused
> > > > > > > > problems, as
> > > > > > > > the next instruction was not detected as a jump target
> > > > > > > > (no
> > > > > > > > "jit_flag_patch" set). I solved this by just checking the
> > > > > > > > previous
> > > > > > > > node's code against a table of accepted values.
> > > > > > >
> > > > > > >   Not certain if I fully understand this. It also appears
> > > > > > > to be a
> > > > > > > somewhat
> > > > > > > fragile logic.
> > > > > > >
> > > > > > >   Everytime the delay slot is used, there should be a
> > > > > > > comment
> > > > > > > telling that instruction is in the delay slot, for example:
> > > > > > >
> > > > > > > static void
> > > > > > > _ltr_f(jit_state_t *_jit, jit_int32_t r0, jit_int32_t r1,
> > > > > > > jit_int32_t
> > > > > > > r2)
> > > > > > > {
> > > > > > >     jit_word_t        w;
> > > > > > >     C_OLT_S(r1, r2);
> > > > > > >     w = _jit->pc.w;
> > > > > > >     BC1T(0);
> > > > > > >     /* delay slot */
> > > > > > >     movi(r0, 1);
> > > > > > >     movi(r0, 0);
> > > > > > >     patch_at(w, _jit->pc.w);
> > > > > > > }
> > > > > >
> > > > > > There is a comment indeed, but the comment is gone as soon as
> > > > > > you
> > > > > > compile the code.
> > > > > >
> > > > > > If you have a 'ltr_f' followed by a 'beqr', the code will try
> > > > > > to
> > > > > > find a
> > > > > > delay slot for the 'beqr', and will detect that the 'movi(r0,
> > > > > > 0)'
> > > > > > (aka.
> > > > > > the last opcode generated by 'ltr_f') can safely be moved,
> > > > > > because
> > > > > > there is nothing that can tell us that the last movi is a
> > > > > > jump
> > > > > > target.
> > > > > >
> > > > > > So I just blacklist 'ltr_f' as well as others so that the
> > > > > > emitted
> > > > > > code
> > > > > > will always be correct.
> > > > >
> > > > >   I understand now.  The problem I see is that it is not very
> > > > > clear
> > > > > from
> > > > > the patch, and/or changes elsewhere could cause it to no longer
> > > > > be
> > > > > required, or have some missing check if new lightning
> > > > > instructions
> > > > > are added.
> > > > >
> > > > >   The problem is that can_swap_ds() does not check the pattern
> > > > > of a
> > > > > jump in 2 (for fpr comparison) or 3 (for compare-and-swap)
> > > > > instructions
> > > > > before the last machine instruction that it attempts to move.
> > > >
> > > > Yes, it only checks the last instruction. Going further than that
> > > > would
> > > > increase the complexity of the code, I think.
> > >
> > >   Could save the address of the last sequence of opcodes to
> > > implement a lightning instruction and do a simple disassembly
> > > and check if there are jumps inside that chunk, or to the next
> > > instruction after the chunk. Could be a bit more complex in
> > > code, but smaller in textual C code, as it would be probably
> > > be no longer than the switch for prev->code. Can look at _patch_at
> > > to see how it checks for jumps and the displacement.
> >
> >   I have a new regression.
> >   check/tramp.tst with the patch, and manually adding Lighting
> > instructions
> > generates at start:
> >
> >     jit_jmpi(main)
> > 0x3520968000:    j    0x3520968118
> > 0x3520968004:    sltiu    at,s0,2
> >    jit_tramp(64)
> > 0x3520968008:    bnez    at,0x3520968110
> > 0x352096800c:    nop
> >
> > and without the patch:
> >
> >     jit_jmpi(main)
> > 0x2d884cc000:    j    0x2d884cc12c
> > 0x2d884cc004:    nop
> >    jit_tramp(64)
> > 0x2d884cc008:    sltiu    at,s0,2
> > 0x2d884cc00c:    bnez    at,0x2d884cc120
> > 0x2d884cc010:    nop
> >
> > jit_tramp() calls jit_prolog(), but it does not generate code, just
> > assumes a prolog. The patch removes the delay slot of the jump
> > to the start of the code and moves the first instruction before
> > the implicit label.
> >
> >   Found it working on patchset to use a variable stack framesize
> > and other optimizations for leaf functions.
> >
> >   Temporary fix was to just make can_swap_ds always return 0.
> >
> >   Hopefully soon  will try to redesign the logic as described
> > earlier,
> > to make a quick disassembly.
> >
> >   First patch was:
> >
> > -    if (!prev)
> > +    /* FIXME this fixes tramp.tst and ctramp.c tests */
> > +    if (1||!prev)
> >
> > while writing this email, just validated that this also works:
> >
> >      switch (prev->code) {
> > +   case jit_code_prolog:
> >      case jit_code_ltr_f:
> >
> > or, could also special case like this:
> >
> >      switch (prev->code) {
> > +   case jit_code_prolog: if (jitc->function->assume_frame) return 0;
> >      case jit_code_ltr_f:
>
> Good catch.
>
> I think this solution would work best.

Now fixed following the idea of that suggestion:
https://git.savannah.gnu.org/cgit/lightning.git/commit/?id=b0f75ce6a6bcf9161c616e229c2f855d2d6d6b98

The idea was correct. The problem was that prev was not set
for jit_code_prolog.

> Cheers,
> -Paul
>
> > > > >   Currently there isn't a clear way to pass parameters to jit
> > > > > generation.
> > > > >   The closest is the jit_cpu_t for arm and x86. Maybe we could
> > > > > add a
> > > > > jit_cpu_t for mips, and have an option to make the swap with
> > > > > delay
> > > > > slot optional? This would make it easy to validate that if a
> > > > > bug is
> > > > > found
> > > > > in the future, it can be easily verified that it is not a side
> > > > > effect
> > > > > of the
> > > > > changes in these patches. That is, have a defined jit_cpu_t for
> > > > > mips
> > > > > in include/lightning/jit_mips.h and then in
> > > > > lib/jit_mips.c:jit_get_cpu()
> > > > > set it to a default value, and use the flag allowing or
> > > > > disallowing
> > > > > the
> > > > > patch during jit generation.
> > > >
> > > > That should be doable, yes. But what would set the value of the
> > > > flag?
> > > > Would it be a ./configure option?
> > >
> > >   Just declare a jit_cpu_t like done in include/lightning/jit_x86.h
> > > and include/lightning/jit_arm.h
> > >
> > >   It would have a single bit flag, that would be checked, for
> > > example:
> > >
> > >  static jit_bool_t _can_swap_ds(jit_state_t *_jit, jit_node_t
> > > *prev,
> > > ...
> > > -    if (!prev)
> > > +    if (!prev || !jit_cpu.can_swap_ds)
> > > ...
> > >
> > > can choose another name for the flag.
> > >
> > >   The usage of jit_cpu_t is not documented, and should really be
> > > used for cpu features, but is already used for testing, for
> > > example,
> > > disabling jit_cpu.sse2. It is only checked for i686, and
> > > dynamically
> > > switches to use x87 registers instead of xmm registers.
> > >
> > > > > > > so, you check the previous opcode for float comparison and
> > > > > > > cas{r,i}.
> > > > > > > Shouldn't it also check for b{o,x}{add,sub}{r,i} codes?
> > > > > > > These
> > > > > > > also
> > > > > > > end with a delay slot usage, and based on the patches and
> > > > > > > your
> > > > > > > description, not checking them might create very difficult
> > > > > > > to
> > > > > > > debug
> > > > > > > bugs in jit doing complex branch logic on carry/overflow.
> > > > > >
> > > > > > The b{o,x}{add,sub}{r,i} end up with a delay slot, and that's
> > > > > > perfectly
> > > > > > fine. The algorithm will detect that and won't try to move
> > > > > > the
> > > > > > opcode
> > > > > > (that's the has_delay_slot() check in can_swap_ds()).
> > > > > >
> > > > > > Cheers,
> > > > > > -Paul
> > > > >
> > > > > Thanks,
> > > > > Paulo
> > > >
> > > > Cheers,
> > > > -Paul
> > >
> > >   These are just suggestions. For the moment it should be good
> > > as is. Checking if any issue is related to this patch would be just
> > > rebuilding and always returning zero from can_swap_ds().
> > >
> > > Thanks,
> > > Paulo
> >
> > Thanks!
> > Paulo
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v2 0/9] mips: Fill delay slots v3, (continued)
Prev by Date: Re: [PATCH v2 0/9] mips: Fill delay slots v3
Next by Date: [PATCH] mips: Optimize ldx* generators
Previous by thread: Re: [PATCH v2 0/9] mips: Fill delay slots v3
Next by thread: Rework of x86 function prolog and epilog
Index(es):
- Date
- Thread