groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] Re: Unicode, EBCDIC, Latin-2, JIS for groff


From: Werner LEMBERG
Subject: Re: [Groff] Re: Unicode, EBCDIC, Latin-2, JIS for groff
Date: Sat, 11 Mar 2000 13:42:07 GMT

>   *  The file iterator recognizes valid UTF-8 patterns in the input,
>      and when they are encountered they get transmuted into
>      \U'number'.  Latin-1 characters (which is to say, eight-bit
>      characters that are not part of a legal UTF-8 sequence) are
>      also temporarily translated into \U sequences; ASCII characters
>      are passed through unchanged.

Below you can find Markus Kuhn's UTF8 test file which contains a lot
of small devils...  I strongly suggest that your UTF8-reading routine
handles all this gracefully.

>      Accepting Latin-2 or whatever based on a command line option
>      would be easy to add; accepting EBCDIC would also be easy if
>      everyone could agree on what EBCDIC characters should map to
>      what Unicode characters.

AFAIK, there are 1:1 mapping tables (from ftp.unicode.org?).  Note
that there is not a single EBCDIC; IBM has defined a lot of EBCDIC
code pages.

>   *  The tokenization routine recognizes \U and converts anything
>      outside the range 0x00 to 0xFF into \[char0xNNNN] or
>      \[char0xNNNNNNNN] as appropriate.
> 
>      This makes non-Latin1 characters second-class citizens (they
>      can't be used in the names of macros, etc.), but I was
>      intimidated by the task of finding every place in the program
>      that depends on characters being at most eight bits wide.

Your solution seems to be a first step, but I think we have to convert
groff to 32 bit internally.

>   *  I haven't done anything with right-to-left or reordered
>      characters.  As I understand it, Plan 9 troff doesn't support
>      these (or combining accents) either.

Maybe a preprocessor for that?  Similar to Indic scripts.

> If you want the *output* to be UTF-8 as well as the input, this is
> also going to require changes to all the postprocessors.  It is what
> Plan 9 troff does, though.

I think only the TTY device needs UTF-8, and this has already been
implemented.  Regarding grohtml, I'm not sure...

For typesetting devices, I think we should rather think in glyphs than
in character sets.


    Werner


begin 644 UTF-8-test.txt.gz
M'XL(",#Y[3<"`U541BTX+71E<W0N='AT`-5;^7,C1Q7FY_P5C_DE=B(+6Y*O
M+98B;-:L(4<5NRZ*'UNC'JOCT8PR/6-95*HP))!P)MPW"?=-N`(D)#@)][7A
MOF$WW/>&,X%<O-<]1\](7CQKR_*Z4IOU:O3-ZW=\[W7/-RLGEJ86H,5MO\4#
ML%F7-84KPCXPKP4R#+B4$'(9PHY^KKE@:H<_.\6#/?VYYH++6;`627AFU/;@
MR9TU_-]3F=VI^L'J4V`*9A87%Z>F%Z=J<^.R[T1;Q`X/^48(?(-UA,=E!=I^
M#U;,8$EH8XQ<#NLL$#ZN*>QWN03?R>'9?A!$W9"WP`_`#]L\Z`G)address@hidden&
MPEN-426_.N*>S65U7]=[B8T6MLB,T(?EXU?"S/1<8VYJIH(&V:'P/0G/JLZK
M;*Q5ZW8%&*Y_7=@<`FYSL4[?-/'BQ;29Z^I%=@,>XI>L#G,=/^B@(Y*5XN>`
M#@').AQZK(^_L!!$F,-+,22"V&T6,!O_(;X4O1V%4K2XPF$M7SE:1DW)0ZL*
M*I8=SG`-*5XD(S0MOA=]*[,K'P="#WC7939^U,1Z3'[K<"\TXIN:-+%R\=+2
MTJ63%>BUA=T&address@hidden@5:VB?ARM>YP'9B/>0Y%V\?;!6,=:+6<)`BHYP
M60"RWVGZ;A660^B(U78(342!5=]O`2Z:4<C61;address@hidden&V7YP1^
address@hidden,X3]!?#V68\$L=?%6&)>#Z&=IVAK4TW#:<=(2*ZR/'Q
MCV84&EGO1&1I#L_EJPQ7N(3K;N*%R("address@hidden<SD+\&JTSE_UA%2WT`[!]/(C
MMT7>B"1O54V\9Y!=8M7S`\K,P:235(>1EUF=+E-"3V#2=AB&*L-K\6:TNJJP
M_("CEQU'V)$;JHJP,:1H9HL"@address@hidden>$TF,:W7?^(JH*EW<H5WC[;5]1]K<
M7J.2X)0%&->SU=O$S"15&1(C)A!>@:F&N5TQ\29J^4NP,I&C50+['L_5J%&:
M&'.\C5AGKE&\"D]BKC`7BY;B.5&?5"GL^*[K]RCH5T=^R-)"!>80'!6RJQ(W
M7H.)9RPGJRTL3K2KSUL5$%5>address@hidden(5]'W[';@>^*Y^BXA9I\T\=1R8BIS
M!)4`ZTNP3AP["D>ON-1*BLYEDEC&X\C;/@@'^GX$+=^[D)*#%^JW0I\&V=`1
M,-FF$O$[G+I<DV.`>$)C>%_F(B;5.C(#KXZX'V$)TCJD7EFR:H:UE\1Y?M$L
M6M?W5A-;J>EPC*.!UQ-(*M8UV`V68Z?H4%"?49\QO,$&;TWU1`M_(]*J8.#<
MJ./1C28"Q;LFGP;(!I,)!4ER&.9#Y*E&%;=Z(GEA,address@hidden,/F:G'>8)1N
M4IF'3**F#\Q6>\WM9Y6C0I8;0-!PKR55^//S!N8<>%&address@hidden
M=E7'address@hidden'<.Q!_W9TRZ:E[H:T]MZ^&8#C9%WB1^UQ-0L>"/N>@SF2Y![73GQZ
MP/D:]'",@PO7?-GA%R8>M4[><?]=M]]W[<D[3]YF[:>5-8#+A2<Z44?Q:H=M
MJ+\C_XH6_8G3#Y9:$&address@hidden>ZC%X>-05GWX`RI)128%!,I)1(;&G@>
MSL)2XC[`I8T;,I4>QC*0R>I8LZ96Q;Q>address@hidden>74JA9*O$W<@0Z`"F+"6_0
M^?ME'address@hidden:;6)F:CG\F4W:PGF"5`\0DK"E`F0$NF(#W;EJE`.L`]0+@
M0L["T[>41&P`-'*(,X4U/W#39DG(68!9$[)6=.-#-VR6Q<3=^)R!V1@(S2/7
M;98"W?O\P6!?QO8FO4=CW]#\GG<,)S[/*@<X)+_G'1/PU)95"G`POQTG!WAF
MJR1B,;address@hidden(0OY77>*D`]OE<;,Y_>\,X#YZ%8YT+W.'XS,address@hidden
M:JI3=J]X?*.+392:)X[1PHOTGD8[=JSV':46'IW%/N-(0?(N-NV08RO7FT.7
M=J(R.U3)G:J8-"-""7[/JQZ(>&0=?W"QTQL+TX?`VBEA:[R$88?B-1W$V[+&
MN=XZT>address@hidden)EZ#V'`;O$VK/-XL<>%V>*4LU'AS1(3;XI6Q4./-$PEN
MCU?"0HVW`#!_-KQ-:address@hidden/7-&<%V8:V3SPQ`*FZ"ZF:)DGSPT6ONHYVP^
M_P777O?"%UU_PXM?\M*7O?P5-^YVEWS3*U_UZM>\]G6O?\,;W_3FM[SU;3?O
M%N^6M[_CG>]Z]WO>^[[W?^"#'_KP1V[=+=Y'/_;Q3WSRMD]]^C.WW_'9.^_Z
M7,F:'76^$/OYGFH*(0M"\U3E8-A'?$^'9_4:.(KV=<YB9M>F=(=+QP;,8YOR
MN.5,5LYJG]KYZL/0Y"&([#+;.#0_-)[U4GW<#??`O?!Y^`)\$;X$7X:OP%?A
M:_!U^`9\$T["?:7C\2WX-GP'address@hidden@^_##^"'\"/X,address@hidden@Y_#+^`46&..
M;TW'=V:N&-_ZD/ARBB\_K^-[&NZ'7\*OX-?P&address@hidden/\*?X,address@hidden
MG!E[/.HZ'@O%<#2&A,.A<#CSP\-QGL3C`?@;_!W^`?^$?\&_X4$X8/Q,\QK%
MHU&,Q^RP>"Q0/)KG=3P>@O_`?^'A7<1AM/&8U?$8:$=SP^)A4SQ:YW4\'H%'
M]R`6HXI'/9MVI7[PI1X9#N[E.@('X)RH8Q_LHT1)$T3I)&R_TW6YD2?F#CV_
M+2=S79[#&]R?ZP>OZLFG>?+O>address@hidden(9H*#U\\>C)T,RT?I)5
MA8DCV95,71S+:*`V.=KXTOQ7F/2,..=".[%RL7%^:]UM#<7#>:-^3GBG-X<#
address@hidden@GP@<UM$)'R9\\)\:'-[2"1M>;."?*1S6TQYTI$QCS'M4X-QYLO$1GS
M&-<ZLS4<<*%$9,Q37#J^'8ZX6"(RYB&N.KT=#HE%MN/(F&>X^O#6V@>.Q70\
MXGLV"[FG"97.!`>93(ZG!Q#'$F.A'PTQ%&X?JNJ8*S:;)"H%CIPU-#L&7@;2
MY.3UE)4/'8R>IZCM]":2!U6[*L]39[8P8RG%'BUSGK^??1KY9[F3'D*5/CH?
ML7TG<N*HL.?'%MK,4UK`;I>address@hidden>address@hidden@,#A4C\1_U08?#8;`>L_8$#_N@
MXQ#>XWN#5U?VT7].#/S88X^?$_;>^P\9[,IU'I"F:O>/F4:>?YF%)`^C!,R(
MBIFZ9J60XV&H!5$&'GV0Z%UKU6DZV_-:+&A5X1AN,=`3%?IN7^&34Y3RF'N0
MZ#VT3#_#*Z:]%I,$B6*$Q3L7-*D;^*W(address@hidden>3A4N`4LR3(F\G)#4I^K[5D+'
M`;^*P.E+6H&KA,7#5*ZDKJ'ZS/LOX$SZGCRDE);+(;2YVY5DCI*Y4OVZPF9:
M]BT<address@hidden&address@hidden/#,&address@hidden;!9)KL5]VH`L'CYNZ!`3
MO8/TUY%:=!MKX7"ZQB%Z2'[BU>LD`A_,%^;B/:D9K_/_9Y82-MO*R^A/89.D
M'_<&36[BD=R8W-7L=YE4TD0UZ>OODM)5;RC(^9<</[*\G!/]D<:Y8)_P*.-H
M=\*2$3'1'RGIH,/Q;Q.7+4U"3UE&_D:+T(%*#=XW'DHJ//4E7(27:N35+U*;
M-;TQS31A5Y0PFR20NLN'I'QN\@'_H6<IT9.=5F93LZ]5CT;6)JI$E]'-8UUJ
M$4]TN=8>+FGY.>Z.HH!>=^GZ`C==&)-UP7N5Q'LX.F&TXC=B\,-B?60YH&7^
MI%2-0UGTO\K<B\@/%YEUF(]O05FL,72+BQ<4,&0!\B2=Z,T[53CA8\AD%,0)
M7<@_20^<R?(A"ZJ09EQ'address@hidden<O"JR,J\6(I+%%&H.1KE%!$
MA$J5)JLC[1_4?8]N,)IXD[U\:FPA(N/H'\_6BM]!;U?4\[]X3Y]U&(>((UW`
M(-X03B&`XDHE[E':,&$]R9HTCC%T,'DK5[_TGLBV&OF*"G!"&I2A]*Y+*,)(
MJ0T32;/9CX8*XE,QM!8_:O.:'!==B87CBFG:;)V;#(/U-N"X)QZH^;FA=`^T
M,:_19&5/`W-RE^"T=?>M5CG`6@;(IV$ACWF8SCO*(M8S1$<AFJ"'U8%'6<B&
M`;D00Z:HA_6)1UG,60/3SC`U[.'XR.-6:YSA)J%Q+"address@hidden
MF*$M5MLD[T^&3J6%3M]FTWT8&25R=4%[1*`9WL`2:address@hidden>Y8J?[K/*A*79[P$
MH'I8_!J),7ZH]]-P0FC1**O>.R""&=*,JNHL7]&AT5)C+DOQ3%(K3*8E7G(<
M:;XHH5JJOB2.F('F`$?<LZ.S`0U82P'address@hidden<)G'$S240ZPDBG6#%'*$0
MFR9'W+A5`K*A(?7)7<P1\S%D,^6(Z[?*8,XJS/CL+N:(>H;9U!QQ[=:.0?<^
MW'5C/YOOU4FKOF+ELAV.)*/>S:address@hidden/9J
M"B2;5,@)0^PSYH4#^52KH9]ZZ'-XW>ZQ-Q7;?0E9>D,address@hidden>=+O\NV^)&+=
M0'2,)FJV^Y*0#1-R(=>:LW9?$G/6Q,RW^X7IM-V/4>0?O_.X9WC_`S7IQGM[
#00``
`
end


reply via email to

[Prev in Thread] Current Thread [Next in Thread]