[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Rationale for split-string?
From: |
Stephen J. Turnbull |
Subject: |
Re: Rationale for split-string? |
Date: |
Tue, 20 May 2003 10:55:20 +0900 |
User-agent: |
Gnus/5.1001 (Gnus v5.10.1) XEmacs/21.5 (carrot, linux) |
>>>>> "sjt" == Stephen J Turnbull <address@hidden> writes:
sjt> OK. That is satisfactory for XEmacs, and we'll implement
sjt> that.
sjt> Unless you say you prefer to do it yourself, I will also
sjt> submit a patch against GNU Emacs CVS head, and audit the Lisp
sjt> code in CVS head to make sure there are no surprises from
sjt> callers with non-default SEPARATORS.
Enclosed are patches for lisp/subr.el and lispref/strings.texi to
implement the API for split-string discussed earlier.
Also enclosed is the result of an audit of uses of split-string in
Emacs CVS (as of about three weeks ago). I didn't notice any cases
where the changed specification made existing code out-and-out
incorrect, so there are no further patches suggested. However, I
think a lot of the uses with an explicit SEPARATORS are semantically
dubious without using the OMIT-NULLS flag (and most were semantically
dubious before the change to split-string, because it's at least
theoretically possible for a null string to arise in the interior of
the list). Most other uses of split-string are dubious in that either
they depend heavily on undocumented implementation details of other
utilities (eg, that the fields in /etc/mtab are separated by exactly
one space) or are not very robust to bogus input. People who
understand the modules in question might want to take a closer look.
A few I couldn't tell at all without doing a much deeper analysis of
the code than I have time for right now:
./lisp/calendar/todo-mode.el:869: needs checking
./lisp/eshell/em-pred.el:601: needs checking
./lisp/mh-e/mh-utils.el:1606: needs checking
./lisp/textmodes/reftex.el:934: needs checking
./lisp/textmodes/reftex.el:2161: needs checking
If you set default-directory to the root of the Emacs hierarchy, the
following function is useful to jump to the reference. nb. a few of
the references have changed since I started the audit.
(defun sjt/parse-grep-n2 ()
"Parse `grep -n -#' output for filename and line number."
(interactive)
(beginning-of-line)
(when (re-search-forward "^\\(\\S-+\\):\\([0-9]+\\):")
(cons (match-string 1) (string-to-number (match-string 2)))))
(defun sjt/parse-grep-n-and-go ()
"Jump to place specified by `grep -n' output."
(interactive)
(let* ((pair (sjt/parse-grep-n2))
(file (car pair))
(line (cdr pair)))
(find-file file)
(goto-line line)))
lisp/ChangeLog 2003-05-16 Stephen J. Turnbull <address@hidden>
* subr.el (split-string): Implement specification that splitting
on explicit separators retains null fields. Add new argument
OMIT-NULLS. Special-case (split-string "a string").
lispref/ChangeLog
2003-05-16 Stephen J. Turnbull <address@hidden>
* strings.texi (Creating Strings): Update split-string
specification and examples.
Index: lisp/subr.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/subr.el,v
retrieving revision 1.350
diff -u -r1.350 subr.el
--- lisp/subr.el 24 Apr 2003 23:14:12 -0000 1.350
+++ lisp/subr.el 16 May 2003 10:03:58 -0000
@@ -1792,19 +1792,45 @@
(buffer-substring-no-properties (match-beginning num)
(match-end num)))))
-(defun split-string (string &optional separators)
- "Splits STRING into substrings where there are matches for SEPARATORS.
-Each match for SEPARATORS is a splitting point.
-The substrings between the splitting points are made into a list
+(defconst split-string-default-separators "[ \f\t\n\r\v]+"
+ "The default value of separators for `split-string'.
+
+A regexp matching strings of whitespace. May be locale-dependent
+\(as yet unimplemented). Should not match non-breaking spaces.
+
+Warning: binding this to a different value and using it as default is
+likely to have undesired semantics.")
+
+;; The specification says that if both SEPARATORS and OMIT-NULLS are
+;; defaulted, OMIT-NULLS should be treated as t. Simplifying the logical
+;; expression leads to the equivalent implementation that if SEPARATORS
+;; is defaulted, OMIT-NULLS is treated as t.
+(defun split-string (string &optional separators omit-nulls)
+ "Splits STRING into substrings bounded by matches for SEPARATORS.
+
+The beginning and end of STRING, and each match for SEPARATORS, are
+splitting points. The substrings matching SEPARATORS are removed, and
+the substrings between the splitting points are collected as a list,
which is returned.
-If SEPARATORS is absent, it defaults to \"[ \\f\\t\\n\\r\\v]+\".
-If there is match for SEPARATORS at the beginning of STRING, we do not
-include a null substring for that. Likewise, if there is a match
-at the end of STRING, we don't include a null substring for that.
+If SEPARATORS is non-nil, it should be a regular expression matching text
+which separates, but is not part of, the substrings. If nil it defaults to
+`split-string-default-separators', normally \"[ \\f\\t\\n\\r\\v]+\", and
+OMIT-NULLS is forced to t.
+
+If OMIT-NULLs is t, zero-length substrings are omitted from the list \(so
+that for the default value of SEPARATORS leading and trailing whitespace
+are effectively trimmed). If nil, all zero-length substrings are retained,
+which correctly parses CSV format, for example.
+
+Note that the effect of `(split-string STRING)' is the same as
+`(split-string STRING split-string-default-separators t)'). In the rare
+case that you wish to retain zero-length substrings when splitting on
+whitespace, use `(split-string STRING split-string-default-separators)'.
Modifies the match data; use `save-match-data' if necessary."
- (let ((rexp (or separators "[ \f\t\n\r\v]+"))
+ (let ((keep-nulls (not (if separators omit-nulls t)))
+ (rexp (or separators split-string-default-separators))
(start 0)
notfirst
(list nil))
@@ -1813,16 +1839,14 @@
(= start (match-beginning 0))
(< start (length string)))
(1+ start) start))
- (< (match-beginning 0) (length string)))
+ (< start (length string)))
(setq notfirst t)
- (or (eq (match-beginning 0) 0)
- (and (eq (match-beginning 0) (match-end 0))
- (eq (match-beginning 0) start))
+ (if (or keep-nulls (< start (match-beginning 0)))
(setq list
(cons (substring string start (match-beginning 0))
list)))
(setq start (match-end 0)))
- (or (eq start (length string))
+ (if (or keep-nulls (< start (length string)))
(setq list
(cons (substring string start)
list)))
Index: lispref/strings.texi
===================================================================
RCS file: /cvsroot/emacs/emacs/lispref/strings.texi,v
retrieving revision 1.23
diff -u -r1.23 strings.texi
--- lispref/strings.texi 4 Feb 2003 14:47:54 -0000 1.23
+++ lispref/strings.texi 16 May 2003 10:03:59 -0000
@@ -259,30 +259,46 @@
Lists}.
@end defun
address@hidden split-string string separators
address@hidden split-string string separators omit-nulls
This function splits @var{string} into substrings at matches for the regular
expression @var{separators}. Each match for @var{separators} defines a
splitting point; the substrings between the splitting points are made
-into a list, which is the value returned by @code{split-string}.
+into a list, which is the value returned by @code{split-string}. If
address@hidden is @code{t}, null strings will be removed from the
+result list. Otherwise, null strings are left in the result.
If @var{separators} is @code{nil} (or omitted),
-the default is @code{"[ \f\t\n\r\v]+"}.
+the default is the value of @code{split-string-default-separators}.
-For example,
address@hidden split-string-default-separators
+The default value of @var{separators} for @code{split-string}, initially
address@hidden"[ \f\t\n\r\v]+"}.
+
+As a special case, when @var{separators} is @code{nil} (or omitted),
+null strings are always omitted from the result. Thus:
@example
-(split-string "Soup is good food" "o")
address@hidden ("S" "up is g" "" "d f" "" "d")
-(split-string "Soup is good food" "o+")
address@hidden ("S" "up is g" "d f" "d")
+(split-string " two words ")
address@hidden ("two" "words")
address@hidden example
+
+The result is not @samp{("" "two" "words" "")}, which would rarely be
+useful. If you need such a result, use an explict value for
address@hidden:
+
address@hidden
+(split-string " two words " split-string-default-separators)
address@hidden ("" "two" "words" "")
@end example
-When there is a match adjacent to the beginning or end of the string,
-this does not cause a null string to appear at the beginning or end
-of the list:
+More examples:
@example
-(split-string "out to moo" "o+")
address@hidden ("ut t" " m")
+(split-string "Soup is good food" "o")
address@hidden ("S" "up is g" "" "d f" "" "d")
+(split-string "Soup is good food" "o" t)
address@hidden ("S" "up is g" "d f" "d")
+(split-string "Soup is good food" "o+")
address@hidden ("S" "up is g" "d f" "d")
@end example
Empty matches do count, when not adjacent to another match:
bash-2.05b$ find . -name '*.el' | xargs fgrep -2 -n split-string /dev/null
./lisp/apropos.el:267: want OMIT-NULLS t
./lisp/calendar/todo-mode.el:869: needs checking
./lisp/cvs-status.el:286: new semantics preferred; no error checking
./lisp/diff-mode.el:1047: OK, double default
./lisp/ediff-diff.el:1143: OK
./lisp/emacs-lisp/authors.el:460: double default, OK
./lisp/emacs-lisp/crm.el:419: new semantics preferred; no error checking
./lisp/emacs-lisp/crm.el:605: new semantics preferred; no error checking
./lisp/emacs-lisp/lisp-mnt.el:412: want OMIT-NULLS t
./lisp/emacs-lisp/unsafep.el:111: mentioned in comment, not used
./lisp/eshell/em-cmpl.el:403: new semantics preferred; no error checking
./lisp/eshell/em-ls.el:257: OK, double default
./lisp/eshell/em-pred.el:601: needs checking
./lisp/eshell/esh-util.el:228: want OMIT-NULLS t
./lisp/eshell/esh-util.el:449: new semantics preferred; no error checking
./lisp/eshell/esh-var.el:568: new semantics preferred; no error checking
./lisp/files.el:4254: double default, OK
./lisp/filesets.el:1202: new semantics preferred; no error checking
./lisp/gdb-ui.el:1001: new semantics preferred; no error checking
./lisp/gnus/gnus-art.el:4645: new semantics preferred; no error checking
./lisp/gnus/gnus-group.el:3798: OK
./lisp/gnus/gnus.el:2679: OK
./lisp/gnus/gnus.el:2681: OK
./lisp/gnus/mailcap.el:367: OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:502: want OMIT-NULLS t
./lisp/gnus/mailcap.el:648: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/mailcap.el:702: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/mailcap.el:870: OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:940: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/message.el:4701: want OMIT-NULLS t
./lisp/gnus/mm-decode.el:55: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/mm-decode.el:57: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/mm-decode.el:264: new semantics preferred; no error checking
(splitting MIME content type)
./lisp/gnus/mm-decode.el:363: OK, double default
./lisp/gnus/mml.el:307: new semantics preferred; no error checking (splitting
MIME content type)
./lisp/gnus/mml.el:337: ditto
./lisp/gnus/nnslashdot.el:364: OK, double default
./lisp/gnus/nnslashdot.el:488: OK, could use OMIT-NULLS t instead
./lisp/gnus/nnultimate.el:176: OK, could use OMIT-NULLS t instead
./lisp/gnus/pop3.el:249: want OMIT-NULLS t
./lisp/gnus/pop3.el:346: want OMIT-NULLS t
./lisp/gnus/pop3.el:347: want OMIT-NULLS t
./lisp/gnus/pop3.el:409: want OMIT-NULLS t
./lisp/gnus/rfc2231.el:131: new semantics preferred; no error checking
(splitting encoded word into locale info)
./lisp/gud.el:1817: OK
./lisp/gud.el:1847: OK
./lisp/gud.el:2288: OK, double default
./lisp/gud.el:2813: OK
./lisp/hexl.el:635: double default, OK
./lisp/hexl.el:652: double default, OK
./lisp/ido.el:2502: want OMIT-NULLS t
./lisp/ido.el:2868: want OMIT-NULLS t
./lisp/info.el:387: want OMIT-NULLS t
./lisp/info.el:390: want OMIT-NULLS t
./lisp/mail/rfc2368.el:137: OK
./lisp/mail/rfc2368.el:144: new semantics preferred; no error checking
./lisp/mail/smtpmail.el:602: want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:156: want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:289: OK
./lisp/mh-e/mh-alias.el:469: OK
./lisp/mh-e/mh-comp.el:374: OK, double default
./lisp/mh-e/mh-e.el:2164: OK, double default
./lisp/mh-e/mh-index.el:475: OK, double default
./lisp/mh-e/mh-seq.el:966: OK, double default
./lisp/mh-e/mh-utils.el:1606: needs checking
./lisp/net/eudc-export.el:126: OK
./lisp/net/eudc.el:161: Emacs 21 compatible
./lisp/net/eudc.el:419: want OMIT-NULLS t
./lisp/net/eudc.el:442: check this
./lisp/net/eudc.el:833: want OMIT-NULLS t
./lisp/net/eudcb-ldap.el:90: OK
./lisp/net/ldap.el:415: new semantics preferred; no error checking
./lisp/net/ldap.el:420: OK
./lisp/net/tramp.el:5658: check this
./lisp/net/tramp.el:6257: tramp-split-string is not quite emacs compatible
./lisp/pcmpl-cvs.el:175: new semantics preferred; no error checking
./lisp/pcmpl-gnu.el:127: OK, double default
./lisp/pcmpl-linux.el:46: double default, OK
./lisp/pcmpl-linux.el:88: want OMIT-NULLS t
./lisp/pcmpl-linux.el:101: want OMIT-NULLS t
./lisp/pcmpl-rpm.el:39: OK, double default
./lisp/pcmpl-rpm.el:46: OK, double default
./lisp/pcmpl-unix.el:89: new semantics preferred; no error checking
./lisp/pcvs-util.el:227: want OMIT-NULLS t
./lisp/pcvs-util.el:228: want OMIT-NULLS t
./lisp/progmodes/ada-prj.el:590: want OMIT-NULLS t
./lisp/progmodes/ada-xref.el:207: new semantics preferred; no error checking
./lisp/progmodes/fortran.el:267: want OMIT-NULLS t
./lisp/progmodes/idlw-shell.el:1734: could use new split-string with
OMIT-NULLS t
./lisp/progmodes/idlwave.el:3702: prior XEmacs-compatible, could use new
split-string
./lisp/progmodes/inf-lisp.el:285: double default, OK
./lisp/progmodes/vhdl-mode.el:13030: new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13171: new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13698: new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13701: new semantics preferred; no error checking
./lisp/textmodes/bibtex.el:2665: new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:192: Gone?
./lisp/textmodes/reftex-cite.el:373: new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:383: new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:445: OK
./lisp/textmodes/reftex-cite.el:863: new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:961: new semantics preferred; no error checking
./lisp/textmodes/reftex-index.el:1552: new semantics preferred; no error
checking
./lisp/textmodes/reftex-index.el:1685: want OMIT-NULLS t
./lisp/textmodes/reftex-index.el:1734: OK, double default
./lisp/textmodes/reftex-index.el:1748: OK, double default
./lisp/textmodes/reftex-index.el:1755: OK, double default
./lisp/textmodes/reftex-index.el:1762: new semantics preferred; no error
checking
./lisp/textmodes/reftex-index.el:1818: new semantics preferred; no error
checking
./lisp/textmodes/reftex-parse.el:343: new semantics preferred; no error
checking
./lisp/textmodes/reftex-parse.el:482: OK, mapconcat used
./lisp/textmodes/reftex-parse.el:990: new semantics preferred; no error
checking
./lisp/textmodes/reftex.el:934: needs checking
./lisp/textmodes/reftex.el:1455: OK, double default
./lisp/textmodes/reftex.el:1488: OK, double default
./lisp/textmodes/reftex.el:1556: OK, could use OMIT-NULLS t instead
./lisp/textmodes/reftex.el:2161: needs checking (uses explicit re or explicit
ws)
./lisp/vc-cvs.el:789: new semantics preferred; requires rewrite to use
./lisp/xml.el:432: OK
./lisp/xml.el:436: OK
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
- Re: Rationale for split-string?,
Stephen J. Turnbull <=