vile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [vile] spellflt.l: Include UTF-8 code points


From: Thomas Dickey
Subject: Re: [vile] spellflt.l: Include UTF-8 code points
Date: Sun, 23 Jun 2019 16:35:01 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Sun, Jun 23, 2019 at 09:47:18PM +0200, Michael von der Heide wrote:
> It works (hunspell) for me with words like "prüfen" or "Straße". Flex
> generates an 8-bit scanner. UTF-8 should work. Would you mind testing it?

sorry - when you said "code points", I had in mind Unicode.

Applying the term to UTF-8 sequences doesn't seem entirely correct,
though I'm aware people use the two interchangeably.  (not to argue,
but a string isn't a point)

lex/flex will allow ranges, and hexadecimal's standard (hence "lex" too):

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html

> --
> Michael von der Heide
> 
> 
> Thomas Dickey <address@hidden> schrieb am So., 23. Juni 2019, 21:24:
> 
> > On Sun, Jun 23, 2019 at 07:42:26PM +0200, Michael von der Heide wrote:
> > > Would it be possible to include UTF-8 code points to check words
> > containing
> > > umlauts?
> > >
> > > WORD          ([a-zA-Z]|\xc3[\x80-\xbf])+

for reference, that's the UTF-8 encoding for the Unicode codepoints 192-255:

192: 192 0300 0xc0 text "\300" utf8 \303\200
255: 255 0377 0xff text "\377" utf8 \303\277

and

0303: 195 0303 0xc3 text "\303" utf8 \303\203
0200: 128 0200 0x80 text "\200" utf8 \302\200
0277: 191 0277 0xbf text "\277" utf8 \302\277

Possibly clearer (ispell on my Debian8 works with this):

diff -u -r1.59 filters/spellflt.l
--- filters/spellflt.l  2013/12/02 01:32:53     1.59
+++ filters/spellflt.l  2019/06/23 20:28:42
@@ -157,7 +157,10 @@
 
 %}
 
-WORD           [[:alpha:]]([[:alnum:]])*
+ALPHA          [[:alpha:]]
+UMLAUT         \xc3[\x80-\xbf]
+LETTER         ({ALPHA}|{UMLAUT})+
+WORD           {LETTER}({LETTER}|[[:digit:]])*
 
 %%
 

> > > WORD          ([a-zA-Z]|\xc3[\x80-\xbf])+

> >
> > lex/flex doesn't do that :-(
> >
> > They use small (256-entry) tables for the character types.
> >
> > I've seen a (long ago) patch to use big tables (which I've read
> > doesn't work well).
> >
> > on my (too-long) to-do list, I have an idea which could be developed,
> > to provide the feature using character-classes.  That is, flex could
> > be modified (perhaps a month's work...)

-- 
Thomas E. Dickey <address@hidden>
https://invisible-island.net
ftp://ftp.invisible-island.net

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]