help-libidn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: libidn2 0.13


From: Tim Rühsen
Subject: Re: libidn2 0.13
Date: Sun, 08 Jan 2017 17:20:57 +0100
User-agent: KMail/5.2.3 (Linux/4.8.0-2-amd64; KDE/5.28.0; x86_64; ; )

On Sonntag, 8. Januar 2017 14:28:33 CET Simon Josefsson wrote:
> Den Sat, 07 Jan 2017 19:48:42 +0100
> 
> skrev Re: libidn2 0.13:
> > On Dienstag, 3. Januar 2017 10:00:53 CET Nikos Mavrogiannopoulos
> > 
> > wrote:
> > > On Mon, Jan 2, 2017 at 10:17 PM, Tim Rühsen <address@hidden>
> > > 
> > > wrote:
> > > >> * APIs more like libidn's that take a full domain name and do
> > > >> proper
> > > >> 
> > > >>   operations on them.  In several forms, UTF-8, USC-32, locale
> > > >> 
> > > >> encoded, etc.
> > > >> 
> > > >> * APIs to decode a IDNA2008 domain from ACE to Unicode format.
> > > >> That is
> > > >> 
> > > >>   not described by the IDNA2008 RFCs, interestingly enough, but I
> > > >>   suspect people will want it, hah!
> > > > 
> > > > Wget used to use ACE decoding from libidn, but only for
> > > > logging/displaying purpose. Since we switched to libidn2, the
> > > > UTF-8/locale named will not be displayed any more :-). With such
> > > > a function I would reactivate the logging
> > > > code.
> > > 
> > > For gnutls unfortunately the reverse is really necessary and that's
> > > the reason we are stuck with libidn. We need to be able to print the
> > > actual name of the certificate and not only the punycode which is
> > > non-human readable for most languages.
> > 
> > Than let's define a function.
> > 
> > Let me start with a suggestion to get the ball rolling
> > 
> >     int idn2_fromASCII (const uint8_t *src, uint8_t **dst)
> > 
> > 'src' is an UTF-8 encoded string (domain name)
> > 'dst' is the punycode-decoded output, also UTF-8.
> 
> How about copying the libidn APIs?  Here are the low-level per-label
> primitives:
> 
>   /* Core functions */
>   extern IDNAPI int idna_to_ascii_4i (const uint32_t * in, size_t inlen,
>                                     char *out, int flags);
>   extern IDNAPI int idna_to_unicode_44i (const uint32_t * in, size_t
>   inlen, uint32_t * out, size_t * outlen,
>                                        int flags);
> 
> The idna_to_ascii_4i call is roughly equivalent to idn2_lookup.
> idna_to_unicode doesn't exist in libidn2.
> 
> Then the interesting APIs for applications:
> 
>   extern IDNAPI int idna_to_ascii_4z (const uint32_t * input,
>                                     char **output, int flags);
> 
>   extern IDNAPI int idna_to_ascii_8z (const char *input, char **output,
>                                     int flags);
> 
>   extern IDNAPI int idna_to_ascii_lz (const char *input, char **output,
>                                     int flags);
> 
>   extern IDNAPI int idna_to_unicode_4z4z (const uint32_t * input,
>                                         uint32_t ** output, int
>   flags);
> 
>   extern IDNAPI int idna_to_unicode_8z4z (const char *input,
>                                         uint32_t ** output, int
>   flags);
> 
>   extern IDNAPI int idna_to_unicode_8z8z (const char *input,
>                                         char **output, int flags);
> 
>   extern IDNAPI int idna_to_unicode_8zlz (const char *input,
>                                         char **output, int flags);
> 
>   extern IDNAPI int idna_to_unicode_lzlz (const char *input,
>                                         char **output, int flags);
> 
> 
> Mimicking these APIs are probably what's interesting.
> 
> I have mixed feelings about exposing LOOKUP vs REGISTER as separate
> APIs. How about using a FLAGS to select REGISTER functionality?  Most
> applications will want to use LOOKUP, REGISTER is uncommon.  I think it
> is wasteful to burn a separate API point for the REGISTER functionality.

I agree regarding REGISTER functionality.
But I am not sure if we need all the above functions. Did you analyze how and 
if these are used in existing software (e.g. Debian packages depending on 
libidn) ?
To point out the big advantage of this approach: At some point we could merge 
libidn + libidn2. But see further below...

> > Examples:
> > foo.bar -> foo.bar
> > übel.de -> übel.de
> > xn--bel-goa.de -> übel.de
> > xn--bel-goa.größer.de -> übel.größer.de
> 
> Depending on of IDNA2003 vs IDNA2008, TR46, transitional and
> non-transitional, phase of the moon, and so on, of course.

How much does this influence a fromASCII funtionality ? The punycode decoding 
is identical. Isn't IDNA2003/2008 just about encoding !? What is the 
difference when it comes to decoding ?

> > Casing: we leave input as it is - only domain labels that start with
> > xn-- will be converted without any casing check.
> > 
> > Why utf-8 and utf-8 ?
> > - Most applications internally work already with UTF-8.
> > - It is easy to convert to utf-16/utf-32 (ucs2/ucs4).
> > - Leave charset transcoding out of the library
> > - ...
> 
> I'd say most applications actually don't care about
> encoding -- they use strings in Unix locale encoded
> format.

Such application would fail in many situations / environments. Everything goes 
UTF-8 in the long term. How else will you exchange (text) documents between 
international different environments !?

And we are talking about IDN aware applications - very likely they already 
deal with UTF-8.

> > Do we need an additional 'flags' for future use ? Why not.
> 
> Indeed.
> 
> > If we want charset transcoding, we also need input and output
> > charset, maybe also language (e.g. think of turkish i/I casing). Do
> > we want that ?
> 
> I don't recall anyone requesting that from libidn -- and it is possible
> to do that in the application (like the "idn" command line tool does).

Generally, app developers are not aware of side effects when using strcasecmp 
or any other locale dependent function. That is the reason why they won't ever 
ask :-(
However, that doesn't have to be our  first concern.

> Very few applications deal with multiple charsets natively.  The ones
> who do often wants to do the conversion internally.

My code uses iconv and after that idn functions - it could make some lives 
(and code) easier if both would be combined. To make such a decision, we 
should look at existing code. If I find time, i will start and add a wiki page 
for that on Gitlab.

> I think it makes sense to only focus on UTF-8, UCS-4, and
> locale-encoded strings.  For the majority of Unix applications, they
> would want to use the locale-encoded API.  Few will want UTF-8 if they
> are UTF-8-pure applications.  A couple will prefer UCS-4.

Let's see.

Meanwhile, I added the core functionality in branch 'decode' for testing / 
playing around / discussion. 
Output is currently only UTF-8 (also with idn2 -d).

Regards, Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]