emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Eliminating etc/DOC - formatting multiline strings


From: Pip Cet
Subject: Re: Eliminating etc/DOC - formatting multiline strings
Date: Sat, 25 Jan 2025 09:05:08 +0000

"Stefan Kangas" <stefankangas@gmail.com> writes:

> I've been looking a little bit at how we could rid of etc/DOC, as per
> this item in TODO:
>
>     ** Eliminate the etc/DOC file altogether
>     We could try and eliminate the DOC file altogether.  See
>     https://lists.gnu.org/r/emacs-devel/2021-05/msg00237.html

I'd interpret this TODO item as narrowly as possible for a first step:
let's get rid of etc/DOC and put its contents into the Emacs pdmp
instead.  It seems to be less than 1MB; my understanding is it's only
read (while there is a quoting mechanism which modifies read docstrings,
it uses characters which make-docfile never emits), which means that
most operating systems will share the memory between Emacs binaries, and
they can evict the pages entirely and reread it from disk if required.

That's all I think we should do at first: read etc/DOC in temacs in
loadup.el, stringify it, place it in a variable, dump it.

For a second step, I'd like to suggest to rewrite the docfile part of
make-docfile.c, incrementally, in Lisp.  We would still parse C files
with Emacs extensions into fragments which are then concatenated into a
string that looks like etc/DOC does now, but it would be more obvious
how to change things afterwards.

I've reordered the rest of your post a little, to keep my thoughts in
order:

> I'm not sure what exactly are our goals here.  If it just to get rid of
> the external file, and store the strings in a special memory segment,
> then we could change make-docfile.c to generate a .h file instead.  This
> looks doable to me.

I do not think make-docfile.c should generate a .h file containing
"binary" data.  #embed is just around the corner; we've gone for decades
without it, can't we sit out the last decade or so before it's
universally available :-) ?  I have a vague memory that embedding "raw"
data (unibyte? multibyte?) into C turned out to be more complicated than
expected more than once.

And I really don't think we need a *special* memory segment.  We already
have pdumper.  If we don't, we might not want to pregenerate the file at
all.

> If, however, we _also_ want to get rid of the external file generation
> and make-docfile.c completely (keeping only the globals.h part),

You won't be surprised I've looked at (and modified) make-docfile.c, but
I haven't touched the "docfile" part at all.

The globals part is hard to work with.  To some extent that's because
the docfile part gets in the way, and the two are very different.

For a one-point-fifth step, we can split make-docfile into make-globals
and make-docfile.  The performance argument for reading each C file just
twice during a build (make-docfile + GCC) seems obsolete to me.

Once we have a stable and working Lisp version, we can run it in temacs
to generate the large DOC string, then dump it normally; for
!HAVE_PDUMPER, several solutions seem possible, some of which might also
help in special circumstances where we cannot afford to have a large
cold section in the .pdmp file.

I think by that point, which isn't that far away (make-docfile.c is
scary, but mostly because it's C written to run fast on impossibly small
machines, by today's standards), many things will have occurred to us
that could be improved:

Maybe we'd like to move to a Fread-able representation, or we split the
string into several Lisp objects, or use string properties; if it's
multibyte (it's perfectly okay to have byte offsets into a multibyte
string!), we can use emacs-utf-8 encoding so we get some extra
characters, if we need them (there appears to be a quoting mechanism in
doc.c, but make-docfile.c doesn't generate it; I assume the byte
compiler does?)

The rest of your email, if I understand it correctly, is about multiline
strings in the context of DEFUN.  While I have opinions on that, I
believe that we might want to stop after removing the docfile part of
make-docfile from the make-globals part; the latter needs to be
reasonably fast, because it's run when Emacs is rebuilt before any
actual compilation happens.  The new make-docfile can run in parallel
with C compilation, if we want it to.

> just relying on C, I'm not sure about one aspect:

> How would we want to write docstrings given the lack of multiline
> strings in C?

I'm not saying moving to what C has to offer today is a bad idea, but it
seems to me it's not true that we have to choose between make-docfile.c
and plain C99.

If we conclude that C99 (or whatever version we decide to use) is almost
precisely what we need, and we give up just a little bit of comfort to
use it, compared to a solution that might involve a similar amount of
changes but use our own syntax, we can do that.

Once we have a DEFUN/DEF* parser in Lisp, we can much more easily
experiment with different syntaxes, and find one which we like, rather
than settling for a solution we'll be unhappy with for decades.

Maybe by the time we're ready, C will have added proper multiline
strings (something like heredocs: #embed without switching to another
file) to the standard.

That we might have to emulate this new feature on systems with old C
preprocessors seems an acceptable compromise to me.  The important part,
to me, is that (for sad and well-known reasons), M-x c-mode is currently
unlikely to see the significant development required to edit a new
Emacs-specific docstring syntax comfortably.

> Thanks to using comments that make-docfile.c picks up later, we get away
> with emulating a multiline string in this way.  If we would like to use
> standard C strings, and eschew the external tool, AFAIU, we end up with

Again, I think we should wait.  C is a young programming language and
only just getting around to such advanced features as multiline strings
;-)

> 1.  DEFUN ("map-keymap-internal", Fmap_keymap_internal, \

I'm very confused by the final backslash here.  Was it intentional?

>         Smap_keymap_internal, 2, 2, 0,
>            "Call FUNCTION once for each event binding in KEYMAP.\n\
>     FUNCTION is called with two arguments: the event that is bound, and\n\
>     the definition it is bound to.  The event may be a character range.\n\
>     If KEYMAP has a parent, this function returns it without processing it.")

I find that very hard to read; I'm also not used to it.  I think it
would be even harder to edit.  Would M-q work at all?

> 2.  DEFUN ("map-keymap-internal", Fmap_keymap_internal, \
>         Smap_keymap_internal, 2, 2, 0,
>            "Call FUNCTION once for each event binding in KEYMAP.\n"
>     "FUNCTION is called with two arguments: the event that is bound, and\n"

I'm not sure which part of the whitespace is meant to be included in the
C program and which isn't.  Do we want to retain the convention that the
second and subsequent lines of docstrings start in the first column,
even if there is an extra quote to show us where they otherwise might?

I think we can safely assume 80 columns of text on most of today's
displays, and keeping interchangeable lines at the same indentation
level would make them easier to work with, using the rectangle commands,
for example.

>     "the definition it is bound to.  The event may be a character range.\n"
>     "If KEYMAP has a parent, this function returns it without processing it."

You stopped just where it was getting interesting!  Where do we put the
right parenthesis?  Many other languages think that lines of multiline
strings should be interchangeable, and that would mean using a "\n" on
the final line, too, and putting the right parenthesis on a line of its
own.  (We could share that line with the function arguments, if space
constraints seem important enough for that compaction).

> Maybe we just can't get rid of using an external tool, if we don't want
> docstrings to get quite cumbersome to write?  Or should we build special

I think making things even harder would be a very bad idea, and
restricting ourselves to what C99 has to offer seems like the last step
in evaluating this change to me.

> tooling (e.g. a minor mode) to hide this aspect and take care of this
> editing job transparently for us?  Won't that be brittle?  Or are we

Do we have the resources to do such significant work on both C modes, or
even just one of them?

Org mode has code to make some lines in a buffer appear to use another
mode.  It's not perfect, but its main disadvantage, IIRC, is that it
sometimes opens a new buffer for you to edit the docstring in.  I'm not
sure that would be a disadvantage in our case, as docstring-mode
(without any remnants of C in it) might be easier to write as a major
mode: we'd just teach C mode about the new "#if DOC" preprocessor
directive, steal some code from org mode to turn the next few lines into
a reference to a docstring-mode buffer, then wait for the #endif and go
back to ordinary C mode.

> perhaps happy to edit docstrings even if they look like they do above?
> Am I missing something obvious?  Is doing this a win even?  And so on.

I want to suggest very carefully that someone who knows more about
texinfo than I do (still haven't found a way to re-enable that warning)
might want to look at the differences between texinfo syntax and
docstring syntax and whether they cannot be reduced a little.  If that
seems worth doing, and the significant disadvantages of this suggestion
can be compensated for, we might want to eventually decide to re-flow
docstrings, making simple newlines equivalent to space characters.

However, all this needs to take into account Lisp docstrings, too.  Does
anyone know how much of the documentation (both interactive
documentation and info) overall is in C docstrings, Lisp docstrings,
texinfo, or possibly other formats?

> Thanks in advance for any comments.

I would like to mention that my understanding is that it's currently
hard to change the DEFUN syntax mostly because it involves editing
make-docfile.c.

Many improvements could be made to the DEFUN syntax, making it less
redundant (avoiding repetition) but more readable (using those chars for
structure initializer field names instead).

Seven arguments is too many, IMHO; if we start out from scratch, without
going too much into details, I would suggest:

DEFUN (Fconcat, MANY(), docstring)
  (ptrdiff_t nargs, Lisp_Object *args)

The MANY() has a pair of parentheses because this is a good place not
just to put optional arguments, but to actually put more things (like
static type checks) into the DEFUN rather than the C function.

The technical details of that are secondary: if it helps, think of the
part in the parentheses as generating a string which is then parsed when
the function is called from Lisp, at run time.  We can make it much
faster than that but that's not our primary concern here.

(Of course it would still be possible to define custom Lisp names,
because that's needed for Fplus, and if we really need it, we could keep
the externally-visible symbol; I'd prefer always making it follow the
pattern SFconcat and making defsubr a macro which pastes together such a
name but is called as defsubr (Fconcat).  Please let's not get too
bogged down into such details; almost everything we might want is
possible with the C preprocessor).

So, in summary, let's split make-docfile into a C part and a Lisp part,
then we can actually play around with different DEFUN syntaxes.

Pip




reply via email to

[Prev in Thread] Current Thread [Next in Thread]