groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [groff] Accented Cyrillic characters


From: Robin Haberkorn
Subject: Re: [groff] Accented Cyrillic characters
Date: Thu, 2 Aug 2018 19:47:35 +0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

Hello Ralph!

I see! Groff seems to combine composites to single code points if possible,
probably in order to better support terminals and/or software that cannot
themselves combine them. Makes sense.
But for the rest of glyphs, it should IMHO a) make sure that accentuation glyphs
have a zero-width and b) don't drop them from composite Unicode escapes. Why is
there even something like composite support, where you can even specify Unicode
points if they are always reduced to a single code point in the end?

I tried adding a line like
u0301 0 0 0xCC81
to the R font for devutf8.
But it doesn't work. How does grotty interpret the code? They are obviously not
simply UTF-8 bytes.
(Sorry, I'm not that motivated to seriously debug this in the Groff sources.
Just hoped that somebody would already know what's going on here.)

Best regards,
Robin

02.08.2018 17:26, Ralph Corderoy пишет:
> Hello Robin!
> 
>> Currently, I'm just adding a standalone UTF composite accent character
>> (U+0301) after every vowel I want to show stress on since Unicode does
>> not seem to define separate codepoints for all of the Cyrillic
>> accented vowels.
> 
> That's the recommendation in
> https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
> 
>> the terminal emulator (at least URXVT) will combine the accent and the
>> vowel into a single glyph.
> 
> xterm(1) does too.  libvte-based terminals seem to place it on the line
> above!?
> 
>> This approach of adding accents causes problems with tbl, though. The
>> combination of the two characters into a single glyph screws up tbl's
>> (and/or Groff's) assumptions. For instance, in a table like:
>>     | саморазруше́ние |
>>     | foo bar         |
>> the bars won't properly line up.
> 
> It boils down to persuading `\w', used by tbl(1), that the U+0301 takes
> no space.
> 
>     $ groff -Tutf8 >/dev/null
>     .nr w \w'A'       
>     .tm \nw 
>     24
>     .nr w \w'\[u0435]'
>     .tm \nw 
>     24 
>     .nr w \w'\[u0435]\[u0301]'
>     .tm \nw          
>     48 
>     $
> 
> Tricks like overstrike with `\o' and moving left with \h affect the \w
> but don't give the desired output because grotty(1) also processes them.
> 
>> For instance, \[u0435_0301] should theoretically also format as an
>> accented Cyrillic e.  But what happens instead is that the accent is
>> dropped during formatting.  Curiously, this works when using latin
>> characters. For instance, \[e u0301], \[e aa], \[e '] will result in a
>> properly accented latin e.
> 
> I think those are mapped onto their Unicode rune, and as you start by
> saying, then isn't one for U+0435 combined with U+0301.
> 
>     $ cd /usr/share/groff/1.22.3/font/devutf8
>     $ grep 0435 R
>     u0435_0300        24      0       0x0450
>     u0435_0308        24      0       0x0451
>     u0435_0306        24      0       0x04D7
>     $ grep '0045.*0301' R 
>     u0045_0301      24      0       0x00C9
>     u0045_0304_0301 24      0       0x1E16
>     u0045_0302_0301 24      0       0x1EBE
>     $
> 
> I look forward to solutions and workarounds from the others here.  :-)
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]