[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Emacs-diffs] emacs/doc/lispref nonascii.texi
From: |
Eli Zaretskii |
Subject: |
[Emacs-diffs] emacs/doc/lispref nonascii.texi |
Date: |
Fri, 28 Nov 2008 13:26:18 +0000 |
CVSROOT: /cvsroot/emacs
Module name: emacs
Changes by: Eli Zaretskii <eliz> 08/11/28 13:26:18
Modified files:
doc/lispref : nonascii.texi
Log message:
(Text Representations, Converting Representations, Character Sets,
Scanning Charsets, Translation of Characters): Make text more accurate.
CVSWeb URLs:
http://cvs.savannah.gnu.org/viewcvs/emacs/doc/lispref/nonascii.texi?cvsroot=emacs&r1=1.10&r2=1.11
Patches:
Index: nonascii.texi
===================================================================
RCS file: /cvsroot/emacs/emacs/doc/lispref/nonascii.texi,v
retrieving revision 1.10
retrieving revision 1.11
diff -u -b -r1.10 -r1.11
--- nonascii.texi 22 Nov 2008 18:22:36 -0000 1.10
+++ nonascii.texi 28 Nov 2008 13:26:17 -0000 1.11
@@ -44,7 +44,7 @@
follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
unique number, called a @dfn{codepoint}, to each and every character.
The range of codepoints defined by Unicode, or the Unicode
address@hidden, is @code{0..10FFFF} (in hex) inclusive. Emacs
address@hidden, is @code{0..10FFFF} (in hex), inclusive. Emacs
extends this range with codepoints in the range @code{110000..3FFFFF},
which it uses for representing characters that are not unified with
Unicode and raw 8-bit bytes that cannot be interpreted as characters
@@ -62,7 +62,8 @@
This internal representation is based on one of the encodings defined
by the Unicode Standard, called @dfn{UTF-8}, for representing any
Unicode codepoint, but Emacs extends UTF-8 to represent the additional
-codepoints it uses for raw 8-bit bytes.}.
+codepoints it uses for raw 8-bit bytes and characters not unified with
+Unicode.}.
For example, any @acronym{ASCII} character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc. We call this representation
of text @dfn{multibyte}, because it uses several bytes for each
@@ -157,7 +158,7 @@
Emacs can convert unibyte text to multibyte; it can also convert
multibyte text to unibyte, provided that the multibyte text contains
-only @acronym{ASCII} and 8-bit characters. In general, these
+only @acronym{ASCII} and 8-bit raw bytes. In general, these
conversions happen when inserting text into a buffer, or when putting
text from several strings together in one string. You can also
explicitly convert a string's contents to either representation.
@@ -194,25 +195,32 @@
@defun string-to-multibyte string
This function returns a multibyte string containing the same sequence
of characters as @var{string}. If @var{string} is a multibyte string,
-it is returned unchanged.
+it is returned unchanged. The function assumes that @var{string}
+includes only @acronym{ASCII} characters and raw 8-bit bytes; the
+latter are converted to their multibyte representation corresponding
+to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
+Representations, codepoints}).
@end defun
@defun string-to-unibyte string
This function returns a unibyte string containing the same sequence of
characters as @var{string}. It signals an error if @var{string}
contains a address@hidden character. If @var{string} is a
-unibyte string, it is returned unchanged.
+unibyte string, it is returned unchanged. Use this function for
address@hidden arguments that contain only @acronym{ASCII} and eight-bit
+characters.
@end defun
@defun multibyte-char-to-unibyte char
This convert the multibyte character @var{char} to a unibyte
-character. If @var{char} is a address@hidden character, the
-value is -1.
+character. If @var{char} is a character that is neither
address@hidden nor eight-bit, the value is -1.
@end defun
@defun unibyte-char-to-multibyte char
This convert the unibyte character @var{char} to a multibyte
-character.
+character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
+byte.
@end defun
@node Selecting a Representation
@@ -320,7 +328,7 @@
@cindex coded character set
An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
in which each character is assigned a numeric code point. (The
-Unicode standard calls this a @dfn{coded character set}.) Each
+Unicode standard calls this a @dfn{coded character set}.) Each Emacs
charset has a name which is a symbol. A single character can belong
to any number of different character sets, but it will generally have
a different code point in each charset. Examples of character sets
@@ -387,30 +395,42 @@
@var{charset}.
@end deffn
+ Emacs can convert between its internal representation of a character
+and the character's codepoint in a specific charset. The following
+two functions support these conversions.
+
address@hidden FIXME: decode-char and encode-char accept and ignore an
additional
address@hidden argument @var{restriction}. When that argument actually makes a
address@hidden difference, it should be documented here.
@defun decode-char charset code-point
This function decodes a character that is assigned a @var{code-point}
in @var{charset}, to the corresponding Emacs character, and returns
-that character. If @var{charset} doesn't contain a character of that
-code point, the value is @code{nil}. If @var{code-point} doesnt't fit
-in a Lisp integer (@pxref{Integer Basics, most-positive-fixnum}), it
-can be specified as a cons cell @code{(@var{high} . @var{low})}, where
+it. If @var{charset} doesn't contain a character of that code point,
+the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
+integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
+specified as a cons cell @code{(@var{high} . @var{low})}, where
@var{low} are the lower 16 bits of the value and @var{high} are the
high 16 bits.
@end defun
@defun encode-char char charset
This function returns the code point assigned to the character
address@hidden in @var{charset}. If @var{charset} doesn't contain
address@hidden, the value is @code{nil}.
address@hidden in @var{charset}. If the result does not fit in a Lisp
+integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
+that fits the second argument of @code{decode-char} above. If
address@hidden doesn't have a codepoint for @var{char}, the value is
address@hidden
@end defun
@node Scanning Charsets
@section Scanning for Character Sets
- Sometimes it is useful to find out which character sets appear in a
-part of a buffer or a string. One use for this is in determining which
-coding systems (@pxref{Coding Systems}) are capable of representing all
-of the text in question.
+ Sometimes it is useful to find out, for characters that appear in a
+certain part of a buffer or a string, to which character sets they
+belong. One use for this is in determining which coding systems
+(@pxref{Coding Systems}) are capable of representing all of the text
+in question; another is to determine the font(s) for displaying that
+text.
@defun charset-after &optional pos
This function returns the charset of highest priority containing the
@@ -421,7 +441,7 @@
@defun find-charset-region beg end &optional translation
This function returns a list of the character sets of highest priority
-that contain charcters in the current buffer between positions
+that contain characters in the current buffer between positions
@var{beg} and @var{end}.
The optional argument @var{translation} specifies a translation table to
@@ -453,7 +473,8 @@
A translation table has two extra slots. The first is either
@code{nil} or a translation table that performs the reverse
translation; the second is the maximum number of characters to look up
-for translation.
+for translating sequences of characters (see the description of
address@hidden below).
@defun make-translation-table &rest translations
This function returns a translation table based on the argument
@@ -504,7 +525,7 @@
an array of 256 elements to map byte values 0 through 255 to
characters. Elements may be @code{nil} for untranslated bytes. The
returned table has a translation table for reverse mapping in the
-first extra slot.
+first extra slot, and the value @code{1} in the second extra slot.
This function provides an easy way to make a private coding system
that maps each byte to a specific character. You can specify the
@@ -524,7 +545,8 @@
character or a character sequence). If @var{from} is a vector of
characters, that sequence is translated to @var{to}. The returned
table has a translation table for reverse mapping in the first extra
-slot.
+slot, and the maximum length of all the @var{from} character sequences
+in the second extra slot.
@end defun
@node Coding Systems