help-libidn
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Decoding ACE created by libidn2...


From: Thomas Jacob
Subject: Re: Decoding ACE created by libidn2...
Date: Thu, 06 Jun 2013 15:21:56 +0200

On Wed, 2013-06-05 at 22:57 +0200, Simon Josefsson wrote:

> It is not trivial, and there may be multiple reasonable implementations.
> I have been meaning to write up one way to do it, and to implement that,
> in the hope that it could be established as a standard, but haven't
> found time.  I recall sending a short summary of the steps required to
> the IDNA list (I think) a long time ago when I noticed this issue with
> IDNA2008.

I see...

> > Libidn2 doesn't seem to supply such a function yet, the
> > older Libidn (at least the cmd line tool) doesn't either
> > really, but I can manually split the punycode part from
> > the xn-- in each label and then use Libidn's punycode decoder
> > to reach my goal. Seems a bit of a hassle though.
> 
> Yup, something like this is what a library could implement.  There are
> aspects which is unclear (for example, how to split the domain?  On
> ASCII dot '.' only, or the IDNA2003 domain separators?  Should you split
> on escaped dots?).

Hmm, just noticed that the idnkit2.2 guys actually have implemented
their own interpretation of reverse conversion now, here's some of
what they do:

python t.py  |  /usr/local/bin/idnconv2 -reverse
www.buße.de
www․buße․de
www‥buße‥de
www…buße…de
www⒈buße⒈de
www⒉buße⒉de
www⒊buße⒊de
www⒋buße⒋de
www⒌buße⒌de
www⒍buße⒍de
www⒎buße⒎de
www⒏buße⒏de
www⒐buße⒐de
www⒑buße⒑de
www⒒buße⒒de
www⒓buße⒓de
www⒔buße⒔de
www⒕buße⒕de
www⒖buße⒖de
www⒗buße⒗de
www⒘buße⒘de
www⒙buße⒙de
www⒚buße⒚de
www⒛buße⒛de
www㏂buße㏂de
www㏇buße㏇de
www㏘buße㏘de
www︙buße︙de
www︰buße︰de
www﹒buße﹒de
www.buße.de
www🄀buße🄀de

t.py:
for l in file('lst').readlines():
    if not l.startswith('U+'):
        continue
    ustr = l.split()[0].split('+')[1]
    u = unichr(int(ustr, 16))
    print (u'www%sxn--bue-6ka%sde'  % (u,u)).encode('utf-8')

'lst' contains a text/cutnpaste
from 
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:toNFKC=/\./:]

They don't interpret %2E however:

echo "www%2Exn--bue-6ka%2Ede" | /usr/local/bin/idnconv2 -reverse
www%2exn--bue-6ka%2ede


but to be honest, I don't really understand the intrinsics 
of IDNA2003/2008 and the whole unicode character transformation
and classification rules, that's why I am happy to use
your libraries whenever possible ;=)

   Regards,
       Thomas




reply via email to

[Prev in Thread] Current Thread [Next in Thread]