[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Warn on mid-input line sentence endings
From: |
Alejandro Colomar |
Subject: |
Re: Warn on mid-input line sentence endings |
Date: |
Sun, 30 Apr 2023 03:04:27 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 |
Hi Branden,
On 4/30/23 02:05, G. Branden Robinson wrote:
> I should clarify a couple of points here since I was feeling grumpy when
> I wrote the following, and that made me forget things.
>
> At 2023-04-27T09:45:40-0500, G. Branden Robinson wrote:
>> We're re-covering some familiar ground here.
>>
>> I have a few points I'd like to make.
>>
>> 1. "Semantic newlines" is a terrible term.
>
> I should have said "_Warn on_ semantic newlines" is a terrible
> instruction/summary.
That's why I used the phrase (at least I tried to do it consistently
recently) "warn on S. N. violations".
>
> They are what we _don't_ want to warn about upon encountering them.
>
> If man-pages(7) or other people continue to call the practice of
> breaking *roff input lines after sentence-ending punctuation "semantic
> newlines", I have no complaint. It could also be called "Kernighan
> breaking", in honor of an early popularizer of the practice.
You could use it for the warning name ;).
>
>> 2. Bjarni's comment '"groff" is not the right tool for such things,
>> but "grep" is.' is thoroughly wrong-headed and Ingo was right to
>> reject it with great force. Here a few reasons why. I don't
>> think any of B through D are relevant to mandoc(1) since it
>> doesn't support the features in question (as far as I know).
>>
>> A. The formatter decides where sentence boundaries are based on
>> its input.
>>
>> B. Use of the `cflags' request can change the characters that
>> have sentence-ending semantics. grep(1) cannot know this.
>>
>> C. Sentence-ending characters are subject to character
>> translation (the `tr` request). grep(1) cannot know this.
>>
>> D. The user/document could define a special character that is a
>> sentence-ending character (with `char` and `cflags`). grep(1)
>> cannot know this.
>
> E. Because '.', '?', and '!' are valid characters in *roff
> identifiers, grep(1) can be fooled by special character, register,
> or string interpolations in the input if their identifiers use
> those characters.
>
> Example:
>
> I can't believe \*(I. ate the whole thing.
>
> It is only valid to detect the end of a sentence here if the (recursive)
> _expansion_ of the `I.` string ends with a sentence-ending punctuation
> character.
>
> Further, since string interpolations can result in further string
> interpolations, a finite-state automaton will not suffice to analyze
> this input. You need a stack machine. (IIRC, a stack machine
> recognizes "recursively enumerable" languages.)
>
> This is categorically not what regular expressions can cope with,
> formally.
Well, formally yes. And a regex can't find C function definitions in a
source tree; at least if you try to fool it by writing the most horrible
code in the universe. But I wrote a relatively small script[1] that
finds a lot of C code with pcre2grep(1), and works most of the time. It
has limitations; some of which can be fixed by improving the regexes
(read: making them even more unreadable); some others are likely
impossible to fix with a regex. The biggest limitation I think I've met
is K&R-style functions: I don't think a regex can cope with them.
I believe a regex-based script can be good enough for some purposes,
even if it's not perfect.
Cheers,
Alex
[1]: <http://www.alejandro-colomar.es/src/alx/alx/grepc.git/tree/bin/grepc>
--
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5
OpenPGP_signature
Description: OpenPGP digital signature
Re: Warn on semantic newlines, Dave Kemper, 2023/04/30