Re: Questionable code in handling of wordend in the regexp engine in reg

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questionable code in handling of wordend in the regexp engine in reg

From:	Alan Mackenzie
Subject:	Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date:	Fri, 1 Mar 2019 14:14:48 +0000
User-agent:	Mutt/1.10.1 (2018-07-13)

Hello, Eli.

On Fri, Mar 01, 2019 at 15:46:14 +0200, Eli Zaretskii wrote:
> > Date: Fri, 1 Mar 2019 11:10:18 +0000
> > From: Alan Mackenzie <address@hidden>
> > Cc: address@hidden

> > SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
> > marker.c).  This latter function doesn't handle well the case of `d'
> > being in the middle of a multibyte character; sometimes it "rounds it
> > down", other times it "rounds it up" to a character position.  I think
> > it should be defined as rounding it down.  It would be a relatively
> > simple correction (at least, technically ;-).

> buf_bytepos_to_charpos is not supposed to be called when the byte
> position is in the middle of a multibyte sequence.  We have the
> CHAR_HEAD_P, BYTES_BY_CHAR_HEAD, and related macros for that.

Thanks, I didn't know that.  Maybe we should put an assert into the code,
like Stefan suggested.

> > For that matter, how many charpos <-> bytepos functions are there in
> > Emacs?

> Only one pair of such function exists for buffer text, and another
> pair for strings.

That's good.

> > >           ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
> > >           ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> > >           UPDATE_SYNTAX_TABLE (charpos);

> > > which seems even more broken because `d` might point to the first byte
> > > after the gap, so `d - 1` will point in the middle of the gap, so it's
> > > simply an invalid argument to PTR_TO_OFFSET.

> > I don't think this is right.  Both `d' and `offset' are byte
> > measurements, not character measurements, so it shouldn't matter whether
> > the "- 1" is inside or outside the parens.  However, it would be less
> > confusing if they were both (?all) the same.

> That's orthogonal.  Stefan is right in that you cannot in general do
> pointer arithmetics on pointers into buffer text without considering
> the gap.  You need to convert 'd' into a byte position (which is
> actually an offset from the beginning of buffer text), then decrement
> it, then convert back into a 'char *' pointer to the previous byte.
> The macros used for these conversions take care of skipping the gap.

> However, since the caller already took care to split the text into two
> parts, one before the gap and the other after the gap, it sounds like
> we don't need to bother about the gap in this case, unless 'd - 1'
> happens to point before the beginning of string2 argument to
> re_match_2_internal.

I've got rid of all the questionable "d - 1"s.  All these code pieces now
first do PTR_TO_OFFSET (d), then do SYNTAX_TABLE_BYTE_TO_CHAR on the
result, and then any arithmetic on the result of that.  (See patch
below).

> > > According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> > > the result may be the same as if we did the decrement after the fact,
> > > but it still looks fishy.  WDYT?

> > I think it is suboptimal to have both PTR_TO_OFFSET and
> > POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

> Those macros hide the fact that the argument could be a Lisp string or
> a buffer, so I don't think I agree with you here.

I just meant that having the two names so similar might be confusing.

[ .... ]

> > There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
> > regex-emacs.c.  I think I will check them all, amending them as in your
> > patch.

> > What do you say?

> I'm not Stefan, but what I say is that we should only make sure 'd'
> never points to the very first byte of 'string2'.  If it does, then
> decrementing it will produce invalid results.  If we cannot decide
> whether that situation could happen, we should add an assertion there
> to that effect.

I'm fairly sure it's safe, through always first doing PTR_TO_OFFSET (d),
which takes care of the gap.

Here's the patch (already "tested") which gets rid of the unwanted "d -
1"s:



diff --git a/src/regex-emacs.c b/src/regex-emacs.c
index b667a43a37..45b4f8107c 100644
--- a/src/regex-emacs.c
+++ b/src/regex-emacs.c
@@ -4732,8 +4732,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
re_char *string1,
                int c1, c2;
                int s1, s2;
                int dummy;
-               ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
-               ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+                ptrdiff_t offset = PTR_TO_OFFSET (d);
+                ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
                UPDATE_SYNTAX_TABLE (charpos);
                GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
                s1 = SYNTAX (c1);
@@ -4811,8 +4811,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
re_char *string1,
              int c1, c2;
              int s1, s2;
              int dummy;
-             ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-             ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
              UPDATE_SYNTAX_TABLE (charpos);
              GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
              s1 = SYNTAX (c1);
@@ -4826,7 +4826,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
re_char *string1,
                {
                  PREFETCH_NOLIMIT ();
                  GET_CHAR_AFTER (c2, d, dummy);
-                 UPDATE_SYNTAX_TABLE_FORWARD (charpos);
+                  UPDATE_SYNTAX_TABLE_FORWARD (charpos + 1);
                  s2 = SYNTAX (c2);
 
                  /* ... and S2 is Sword, and WORD_BOUNDARY_P (C1, C2)
@@ -4890,8 +4890,8 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
re_char *string1,
                 is the character at D, and S2 is the syntax of C2.  */
              int c1, c2;
              int s1, s2;
-             ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
-             ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
+              ptrdiff_t offset = PTR_TO_OFFSET (d);
+              ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset) - 1;
              UPDATE_SYNTAX_TABLE (charpos);
              GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
              s1 = SYNTAX (c1);


-- 
Alan Mackenzie (Nuremberg, Germany).

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/01
- Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Stefan Monnier, 2019/03/01
- Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/01
  - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie <=
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/01
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/02
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/02
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Eli Zaretskii, 2019/03/02
    - Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c, Alan Mackenzie, 2019/03/02

Prev by Date: Re: tabulated-list sort icon is reversed
Next by Date: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Previous by thread: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Next by thread: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Index(es):
- Date
- Thread