Sorry for slow reply - unfortunately I haven't had much chance to look into this in detail.
The things that ideally should happen:
- Dwarf unwinder should detect the "last" frame in the stack and stop there. IIRC this was supposed to be based on null return address column in the dwarf info, and there should be an "if" in the generic dwarf parsing to detect this. I don't recall how this was (if at all) recorded through frame stashing and fast trace though, and didn't have time to look into it in detail.
- The frame stash should a) record the frame, b) remember somehow or another it's the last frame.
- The fast trace should stop when at the end of the frame chain. I am not sure but I don't think Arun's suggested check on RBP would be the right thing to do, but I didn't fully trace how its value be would be tracked through the multi-condition "if". Maybe it's the right thing, just not sure.
The main thing I would look at, using full libunwind debug levels, is how the very first pass through the last frame is parsed and handled. First make sure it is correctly parsed and detected as the last frame in the chain, and if that's not the case, maybe look into why either the dwarf frame info is incorrect, or why the heuristics don't correctly detect the case. If and only if that detection is correct, figure out why the fast trace gets it wrong, and falls off the fast path.
We did run into several common enough cases where fast trace wouldn't detect the last frame correctly, and fell off to the slow trace, which would just produce the same result - slower. That was really annoying so you have my full sympathy :-) I tried to fix all the deficiencies we found, but certainly there can be more of them. I was hoping linux system libc would by now correctly annotate everything with dwarf, maybe it's just a matter of suitable configuration, compilation or linking flags somewhere?