emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Another issue with thingatpt


From: Andreas Roehler
Subject: Re: Another issue with thingatpt
Date: Tue, 02 Jan 2007 14:34:46 +0100
User-agent: Thunderbird 1.5.0.4 (X11/20060516)

Bob Rogers schrieb:
   From: Andreas Roehler <address@hidden>
   Date: Sun, 31 Dec 2006 10:25:35 +0100

   > Both interfaces (ffap and thing-at-point) are already customizable,
> though in different ways.
   There is no `defcustom'-form in thingatpt.el,
   it's done mostly with `defvar'. Wouldn't conceive that
   as customizable.

Not in the sense of defcustom, no.  But someone who can't "customize" it
themselves via setq is probably not going to be able to change these
hairy regexps and/or char-classes without shooting themselves in the
foot.  It's not just a matter of understanding Emacs regexps, but
understanding how thing-at-point uses them.
Probably you are right.

   In any case, it seems to me that users shouldn't need to change the
regexp proper, since that is defined by RFC3986, just the set of
punctuation characters to drop at the end.
Maybe I miss something, but AFAIS the regexp in question is not derived in a strict sense. I give the description from RFC

here:

;;;;;;;;;;;;;;

     reserved    = gen-delims / sub-delims

     gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

     sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

...


  Characters that are allowed in a URI but do not have a reserved
  purpose are called unreserved.  These include uppercase and lowercase
  letters, decimal digits, hyphen, period, underscore, and tilde.

     unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

;;;;;;;;;;;;;;;

Thats basically what I detect concerning the matter there.

 The only thing that needs to
be customized is just the "lose the punctuation" heuristic, IMHO.  And
the definition of "punctuation" should be enlarged so that it addresses
Slawomir's issue with parens, which are not even allowed internally.

   The problem mentioned originally however shouldn't occur, as

   ,----
   | (defvar thing-at-point-url-path-regexp
   |   "[^]\t\n \"'()<>[^`{}]*[^]\t\n \"'()<>[^`{}.,;]+"
| "A regular expression probably matching the host and filename or e-mail part of a URL.")
   `----

   includes that char. The error must reside elsewhere.

   Regards,

   Andreas Roehler

It does include a ";" in the second character class, but both are
inverted.  The second set is the same as the first set with the addition
of ".,;", which is why it refuses to match any of these characters at
the end of the URL.  This would be easier to see if the regexp were
written this way:

        (defvar thing-at-point-url-path-regexp
                (concat "[^]\t\n \"'()<>[^`{}]*"
                        "[^]\t\n \"'()<>[^`{}.,;]+")
          "A regular expression probably matching the host and filename or e-mail 
part of a URL.")

                                        -- Bob
Now I see it, thanks a lot.

BTW: What about to drop the `;' from the regexp?

Maybe together with the comma-sign, as this char is mentioned too as a sub-delimiter.

Other problems:

- Char ' (39, #o47, #x27) now seems excluded, whereas RFC mentiones it as a
sub-delimiter too.

- (defvar thing-at-point-short-url-regexp
 (concat "[-A-Za-z0-9.]+" thing-at-point-url-path-regexp)

misses the underscore in its bracket. (unreserved after RFC)



Andreas






reply via email to

[Prev in Thread] Current Thread [Next in Thread]