lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] UTF-16 output from automated GUI test?


From: Vadim Zeitlin
Subject: Re: [lmi] UTF-16 output from automated GUI test?
Date: Thu, 20 Oct 2016 21:55:19 +0200

On Thu, 20 Oct 2016 19:06:00 +0000 Greg Chicares <address@hidden> wrote:

GC> >  Nevertheless, O_BINARY does result in UTF-16 output as I've just tested 
in
GC> > my simple example.
GC> 
GC> Your example is basically:
GC>     _setmode(_fileno(stdout), _O_WTEXT);
GC>     return fputws(L"Hello, world!\n", stdout);
GC> which deliberately uses "wide" strings....

 Just to be clear, what I did was to replace _O_WTEXT by _O_BINARY. I.e.
just setting _O_BINARY results in UTF-16 output from fputws(), there is no
need for any other flags.

GC> which that change had successfully fixed. This differs from your example
GC> above in that 'test_coding_rules.exe' apparently emits ASCII only.

 Yes, it uses only std::cout, not std::wcout.

GC> AFAICS, lmi never deliberately emits any wide string to stdout or stderr,
GC> but I'm guessing that 'wx_test.exe' implicitly emits wide strings simply
GC> because it uses wx.

 You're right that Unicode outputs comes from using wx functions such as
wxFputs() and wxPrintf(). The former could be replaced with fputs() to
avoid it, but the latter is trickier because it's perfectly valid to call
wxPrintf("%s", s) with "s" being a wxString or std::string or std::wstring,
but using printf() in the same way is undefined behaviour and an extra
.c_str() is required when using the standard string classes while dealing
with wxString is even worse. And then there are also a couple of
wxLogError()s in the test code.

 We could get rid of all of them and use std::cout for output. This would
require relatively many changes, but is quite straightforward, so, as much
as I dislike it, this is probably the best solution.

GC> If that's right, then I guess we have these options:
GC> 
GC> 1) Revert that change, and deal with the consequences some other way:
GC>    i.e., decide to live with CRNL terminators.

 This would be my personal choice.

GC> 3) Don't revert; build wx with 'enable-utf8'. Perhaps I'll try that,
GC>    because the pitfalls described here:
GC>      http://docs.wxwidgets.org/3.1/overview_unicode.html
GC>    don't sound very important for lmi.

 The main problem with the UTF-8 build is that the CRT locale never uses
UTF-8 under MSW. To be able to use narrow character CRT functions (i.e.
fputs() instead of fputws()) we need to trick wxWidgets into believing that
the locale is UTF-8 and while this works as long as you're only using
ASCII, it's clearly going to be a big problem as soon as an 8 bit character
sneaks in.

GC> [BTW, that 'docs.wxwidgets.org' page has en-dash in three places where
GC> two dashes "--" really are intended, with 'configure' options. That's
GC> probably a doxygen artifact; AIUI, you can escape it as "\--" or "`--`"
GC> to avoid that markdown.]

 Indeed, thanks, should be corrected when the docs are rebuilt after
https://github.com/wxWidgets/wxWidgets/commit/0afb95d2f4c54f1044e550f4457dfffc560a850e

GC> > but this is a big change with a lot of ramifications and I just
GC> > don't see you agreeing to or even considering it in the near future.
GC> 
GC> What am I missing?

 Access to wxString contents using operator[] becomes O(N), meaning that
any loops over it using indices become O(N*2) in string length. This can be
pretty catastrophic even for relatively short strings. We made an effort to
update many of such loops inside wx itself to use iterators (which are
still O(N)), but I'm not completely sure that there are none remaining.
Similarly, I don't think lmi code iterates over strings using indices, but
I'm not 100% sure about it (I'd need to review at least wxPdfDocument code
if we decide to test this).

 So the two big problems for me are conversion failures, with the ensuing
data loss, when converting any non-purely 7 bit ASCII strings to/from the
CRT and a potential for significant performance regressions. I think this
is more than enough to avoid using UTF-8 build under MSW, and the solution
with switching to std::cout in the testing code looks much safer from all
points of view.

 Regards,
VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]