source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking)

Thanks for "Update2 on Bytecode Offset tracking), Zack!

With respect to the C patch (bug#42499), I updated the savannah git branch feature/soc-bytecode-in-traceback-reduced to include the latest changes. This gives developers an alternative way to inspect and try out the code. As before, any developers should feel free to add changes on top of this.

With respect to mapping the source information down to LAP, I think this is a close-enough demonstration to show that it could be done and roughly what's involved. There are still a number of details lying in the weeds.

The last missing little piece to have a complete solution of hooking this into the backtrace is small enough of a gap for developers to easily imagine and leap over.

Of more concern now is the scalability and maintenance of the code.

So at this point, it is time to switch gears and look at the big and long-term picture far outside the 2020 Summer of Code.

In the below I give my thoughts on some of the questions raised.

1. Source-to-Elisp Maps

1a. Lisp Reader

The Lisp reader based on edebug I think is fine as it is in terms of speed for now. As is obvious, getting source information is optional: it is analogous to running gcc with or without -ggdb. Perhaps a better analogy, since we are talking about speed, is with higher levels of optimization like -O3. So even if this is 5 times slower (and it isn't), the process could be done on the GNU Emacs Elisp distribution just before the final release. The benefit would last several months or more, enough to justify a slowdown in minutes.

1b. Source-to-Elisp Map Data Structure

The information that is produced, although okay for a proof of concept, could be improved and made more complete.

As Stefan observed, storing strings in each struct is a bit wasteful and unnecessary. It really only needs to be stored at the top-level _expression_/form with some way in each child to get to the parent. Elsewhere I give motivation for storing the string along with buffer offsets.

Also nice in the parent would be the container that the function is found in, and for any other containers that appear in the subexpressions in the tree. (Recall a "container" could be a buffer, a file, an archive member of some sort, or none if the function was built on the fly).

Nice to have at the top level but not strictly necessary, would be some sort of signature of the text. Since the full string is also at the top level, computing signatures (for comparison on both ends of the mapping) can be done on the fly. The cheapest kind of computed signature is simply the length of the string. At the other end of the computation effort is a SHA. Recent Python bytecode uses SipHash which is somewhere in between in computation time.

If we want to look up the text of a string and we are not at the top-level, how do we do that if we magically have a pointer to someplace inside a source-to-elisp tree structure? There could be a pointer to the top for the non-top children. However I think it better to have parent pointers. Often when looking at some source code where there is a problem, you may want to see the bigger context, and the parent gives that. What about speed in looking up the location? The number of nesting levels in a program is generally pretty small, like less than 15. Also, this information is needed only when there is a problem which is probably rare. In short, I don't think one needs to obsess over code to string lookup time.

Another possibility is to assume that access must start at the top.

Lastly, a small thing about a name used in the source-map structure: "code" is used to refer to the source text, a string. "code" like "object" is a bit ambiguous; code could refer to LAP code, bytecode, lambda expressions, or fully compiled code. My suggestion would be to call it "source-text".

1c. Source-to-Elisp Map Files

Just as there are bytecode files, Zach's project produces files with source-text-elisp mappings in them, and these can be independently produced and loaded into memory. Discussed above, this is string bloat in this file. However it is missing additional information that might be nice to have: some sort of optional signature for the container (file, buffer, ...) that this was built from.

2. Code-to-source Maps

In "Update2", a source string was attached to the LAP instruction or LAP-instruction proxy. This was for expedience and proof of concept. In the long term what is wanted is an index into the Source-to-Elisp Map structure.

Therefore, in addition to the source-to-elisp map structure, there should be a way to map a bytecode offset into a position in the Source-to-Elisp Map File. If this is saved on disk, similar to the way Source-to-Elisp Map files are done, then the position would be some tree-node index number in the source-to-elisp tree. When that information is read in, the index node number can be materialized to a true pointer into the source-to-elisp structure, if desired. Or as stated above, access might have to always start from the top-level function.

Note that the overall structure described above is equally valid for other kinds of code, such as the great work that Andrea Corallo is doing.

For compiled code you would take the Source-to-Elisp map and the location indexes through to DWARF. A lot of the decoding-to-source-position code when there is an error I think can also be shared.

2a. ByteCode Compiler

Finally, we come to the thorny problem of dealing with the Bytecode compiler. The code behind Update2 is largely cut and paste which is also okay for prototyping, but untenable in the longer term. Even as this work was being done, changes were going on to the bytecode compiler. See bug#41247.

Here is a possible solution based on Lisp's rich history of being about to introspect and modify code. Basically a transformer function or macro could be written that takes in a byte-compile function as input and changes that into something that works with the source-to-elisp structure instead.

Recall that the source-to-elisp structure can be thought of as a "fat cons" node, but it is implemented in Elisp as opposed to being built-in datatype.

This transformer function might look for function calls that start with the prefix byte-code- and turn those function-calls into corresponding calls in the source-map-bytecode- space.

The cons-node functions could also be looked for and replaced with corresponding wrappers. For example instead of car, source-map-car would be called.
A source-map-car function just checks to see if the object passed is a source-to-elisp-structure. If it is, then the source-map structure's sexp field is used; if not the underlying car function is called.

A transformer function can also have specific transformations for specific functions. Or there may be specific functions that are just cut and pasted as was done in Update2. But this kind of activity would be done keeping in mind that the underlying functions may change, and with a story for how such changes can be tracked without too much trouble.

3. Alists, Plists, and Hashes oh my

Since arefs are pretty fast, I’m guessing the struct slot accesses wouldn’t slow down execution too much, but creating so many records certainly uses a lot of memory.

I don't think this needs to be worried about for a long time, if ever. Only when an error is hit, does this information need to be read in it could be done via some autoload mechanism.

Presumably the number of errors is small enough that just about anything would do. And since progress basically comes to a halt anyway, it really doesn't matter if one method is even a few seconds slower than another. That said, for very small bytecode, alists are probably the fastest. However since offsets are unique, Plists would work too. And the larger the number of offsets there are in bytecode, the more hashes are more desirable.

But as Donald Knuth has said: premature optimization is the root of all evil. I think it sufficient just to have a story or plan for how to speed such things up when the time comes.

From:	Rocky Bernstein
Subject:	source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking)
Date:	Sat, 1 Aug 2020 01:47:55 -0400