Re: [groff] devutf8 on Windows

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [groff] devutf8 on Windows

From:	Jeff Conrad
Subject:	Re: [groff] devutf8 on Windows
Date:	Mon, 25 Feb 2019 01:36:14 +0000

I’ve combined responses to several messages.

On Sunday, February 24, 2019 4:37 AM, Ralph Corderoy wrote:
> 
> > With the code page set to 65001 (UTF-8),
> 
> Using `chcp' or changing the registry?

Using chchp, with cmd and the MKS Korn shell (the MKS stty doesn’t
handle code page 65001).

> https://ss64.com/nt/chcp.html suggests CMD may have trouble compared to
> Powershell?

I couldn’t quite follow what they were getting at here; I don’t seem to
have a problem with pipes (as mentioned, that’s one of the only ways
things work).  I am using Lucida Console font.  I’ve heard many rumors
of problems with code page 65001, but have never been able to pin down
anything specific that I could reproduce.

> What if you have a C program that writes to a `binary' stdout a UTF-8
> byte sequence and run that at CMD?  That removes groff.

This is essentially what I did with the put_char() method from tty.cpp.
It works just fine—this is what puzzled me.

In response to Eli’s request, I’ve included the listing below.

I reset stdout to binary with freopen(), and the results were unchanged.
I also tried unbuffered output using setvbuf(); the UTF-8 characters
disappear.  Sending the output through MKS cat fixes things, even when
given the ‘u’ option (though the pipe may well cause input buffering).
more.com doesn’t like this, giving complete gibberish.

The MKS environment variables for input and output are set to UTF-8, so
there’s no translation.  Briefly: the MKS Toolkit does input-to-output
translation among OEM (e.g., 437), “ANSI” (e.g., 1252) and UTF-8
encodings; this can be specified in the environment as I have done, or
via options to a few programs such as cat and more.  The Toolkit doesn’t
currently seem to honor code page 65001, and when it is set, using the
program options has a nasty habit of resetting everything to the OEM
code page ... this is being looked into.  Essentially, setting
everything to UTF-8 disables translation.

----------------------------------------------------------------------------
On Sunday, February 24, 2019 7:53 AM, Eli Zaretskii wrote:
> What version of Windows is that?

Windows 10 Pro, Version 1803.

> > >groff -Tutf8 quot.txt
> > â€˜quoteâ€™ minus: âˆ’ /f/: /â„
> >
> > The output looks as if it’s being interpreted as code page 1252. If the
> > output is redirected, things look fine, e.g.,
> >
> > >groff -Tutf8 quot.txt | more.com
> > ‘quote’ minus: − /f/: /⁄
> 
> Which version of 'more' is that?  The one which comes with Windows 7
> fails with "not enough memory".  I needed to use a ported Less to have
> this displayed.

This is the one that comes with Windows 10; as mentioned above, it has
problems.  I used it simply to take the MKS Toolkit out of the equation.
I normally use MKS more.

> > I get the same results with the MKS Korn shell.  If the code page is set
> > to 1252 rather than 65001, results are as expected—rendered as code
> > page 1252.
> 
> That's a known issue, see below.

Not sure what you mean here.  Are you referring to the MKS Korn shell or
the Windows console?

> > If I put tty.cpp’s put_char() in a simple program (compiled with
> > Visual Studio 2015) and feed it Unicode values, the output is what I
> > expect—properly rendered UTF-8.
> 
> How did you "feed it Unicode values", exactly?  And what was the
> simple program you used?

A listing is worth a thousand words ...

/*
*****************************************************************************
  unitest3: test UTF-8 display
  Usage: unitest3
  Author: Jeff Conrad
  Date: 22 February 2019
  Revised:
*****************************************************************************
*/

#include <stdlib.h>
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <string.h>

// from tty.cpp, groff 1.22.3
#define putstring(s) fputs(s, stdout)

typedef unsigned int output_character;

static int is_unicode;

// from tty.cpp, groff 1.22.3
void
put_char(output_character wc)
{
  if (is_unicode && wc >= 0x80) {
    char buf[6 + 1];
    int count;
    char *p = buf;
    if (wc < 0x800)
      count = 1, *p = (unsigned char)((wc >> 6) | 0xc0);
    else if (wc < 0x10000)
      count = 2, *p = (unsigned char)((wc >> 12) | 0xe0);
    else if (wc < 0x200000)
      count = 3, *p = (unsigned char)((wc >> 18) | 0xf0);
    else if (wc < 0x4000000)
      count = 4, *p = (unsigned char)((wc >> 24) | 0xf8);
    else if (wc <= 0x7fffffff)
      count = 5, *p = (unsigned char)((wc >> 30) | 0xfC);
    else
      return;
    do *++p = (unsigned char)(((wc >> (6 * --count)) & 0x3f) | 0x80);
      while (count > 0);
    *++p = '\0';
    putstring(buf);
  }
  else
    putchar(wc);
}

int
main(int argc, char *argv[])
{
    unsigned int i, nchars;
    void exit();

    unsigned int chars[] = {
        0x2018, // left single quote
        0x71,
        0x75,
        0x6F,
        0x74,
        0x65,
        0x2019, // right single quote
        0x20,
        0x6D,
        0x69,
        0x6E,
        0x75,
        0x73,
        0x3A,
        0x20,
        0x2212, // minus
        0x20,
        0x2F,
        0x66,
        0x2F,
        0x3A,
        0x20,
        0x2F,
        0x2044  // fraction slash
    };

    is_unicode = 1;

#if 0
    if (setvbuf(stdout, NULL, _IONBF, 80) != 0) {
        perror("Can't setvbuf");
        exit(1);
    }

    putchar(0xE2);
    putchar(0x80);
    putchar(0x98);
    putchar('\n');

    puts("\xE2\x80\x98");
    printf("\xE2\x80\x98\n");
#endif
    nchars = sizeof chars / sizeof(unsigned int);
    for (i = 0; i < nchars; i++)
        put_char(chars[i]);

    putchar('\n');

    exit(0);
}

It was built from the command line using Visual Studio Community 2015.
Offhand, I can’t see how this differs much from what grotty is doing.
The biggest variable I can think of is the compiler.

> It is a known problem with the Windows console: you cannot reliably
> write UTF-8 encoded text to it using the ANSI/Posix emulation
> functions like 'write', 'printf', and their C++ equivalents, and
> expect the correct display, even if you set the codepage to 65001.
> You need to use Unicode (a.k.a. "wide") APIs instead, like WriteFileW
> etc., and you need to feed them text converted into UTF-16.
> 
> This is not a Groff problem, the problem is with the Windows console
> device.

This is what I initially thought, which is why I was surprised that the
output from the program above is OK.

> I heard rumors that Windows 10 is better in this regard, but I don't
> have a Windows 10 box around with Groff on it, so I cannot test that.

I never had 7, going from XP (yeah, I know) to 10, so I can’t comment.

> > I’m running the ezwinports Win32 binary 1.22.3 (the 1.22.4 grotty
> > binary does the same thing).
>
> ezwinports has Groff 1.22.4 since almost 2 weeks ago.

That’s where I got the grotty 1.22.4 binary.

I’m once again trying to build from the source (both normal and Win32),
but it’s proving a tough process.  I had no trouble a dozen years ago,
but since then the normal environment has diverged considerably from
mine.  I don’t think a lot of MKS users build groff ...

----------------------------------------------------------------------------
On Sunday, February 24, 2019 7:54 AM, Eli Zaretskii wrote:

> > What if you have a C program that writes to a `binary' stdout a UTF-8
> > byte sequence and run that at CMD?  That removes groff.
> 
> In general, you will see the same.  As I said, you need to use Unicode
> APIs to reliably write UTF-8 to the Windows console device.

As above, this seems to work fine when I put the grotty character output
into a simple C program.

Thanks for all the ideas.

Regards,
Jeff

> -----Original Message-----
> From: groff [mailto:address@hidden On Behalf
> Of Ralph Corderoy
> Sent: Sunday, February 24, 2019 4:37 AM
> To: address@hidden
> Subject: Re: [groff] devutf8 on Windows
> 
> Hi Jeff,
> 
> I don't know Windows at all...
> 
> > I’m getting strange behavior with devutf8 on Windows.
> >
> > >type quot.txt
> > .pl 1
> > \(oqquote\(cq minus: \(mi /f/: /\(f/
> >
> > With the code page set to 65001 (UTF-8),
> 
> Using `chcp' or changing the registry?
> 
> > from cmd,
> >
> > >groff -Tutf8 quot.txt
> > â€˜quoteâ€™ minus: âˆ’ /f/: /â„
> >
> > The output looks as if it’s being interpreted as code page 1252. If
> > the output is redirected, things look fine, e.g.,
> >
> > >groff -Tutf8 quot.txt | more.com
> > ‘quote’ minus: − /f/: /⁄
> 
> https://ss64.com/nt/chcp.html suggests CMD may have trouble compared to
> Powershell?
> 
> > Any ideas as to what could be happening?
> 
> What if you have a C program that writes to a `binary' stdout a UTF-8
> byte sequence and run that at CMD?  That removes groff.
> 
> --
> Cheers, Ralph.

[Prev in Thread]

Current Thread

[Next in Thread]

[groff] devutf8 on Windows, Jeff Conrad, 2019/02/23
- Re: [groff] devutf8 on Windows, Ralph Corderoy, 2019/02/24
  - Re: [groff] devutf8 on Windows, Eli Zaretskii, 2019/02/24
  - Re: [groff] devutf8 on Windows, Jeff Conrad <=
- Re: [groff] devutf8 on Windows, Eli Zaretskii, 2019/02/24
  - Re: [groff] devutf8 on Windows, Jeff Conrad, 2019/02/25
    - Re: [groff] devutf8 on Windows, Eli Zaretskii, 2019/02/25
    - Re: [groff] devutf8 on Windows, Jeff Conrad, 2019/02/25
    - Re: [groff] devutf8 on Windows, Eli Zaretskii, 2019/02/25
    - Re: [groff] devutf8 on Windows, Jeff Conrad, 2019/02/25
    - Re: [groff] devutf8 on Windows, Eli Zaretskii, 2019/02/25
    - Re: [groff] devutf8 on Windows, Jeff Conrad, 2019/02/25
    - Re: [groff] devutf8 on Windows, Ralph Corderoy, 2019/02/25
    - Re: [groff] devutf8 on Windows, Jeff Conrad, 2019/02/25

Prev by Date: Re: [groff] devutf8 on Windows
Next by Date: Re: [groff] devutf8 on Windows
Previous by thread: Re: [groff] devutf8 on Windows
Next by thread: Re: [groff] devutf8 on Windows
Index(es):
- Date
- Thread