Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

On Mon, Jan 14, 2013 at 5:08 PM, Ivan Raikov <address@hidden> wrote:

Hi Sungjin,

Thanks for trying to use the uri-generic library. As Peter already pointed out, uri-generic and uri-common are intended to implement RFC 3986 (URIs), and so far no effort has been done to support RFC 3987 (IRIs). However, the IRI RFC does define a mapping from IRI to URI, where Unicode characters in IRIs are converted to percent encoded UTF-8 sequences. The caveat here is that if you try to decode these percent-encoded sequences they will likely result in invalid URI characters. I have prototyped a procedure iri->uri which attempts to percent-encode all UTF-8 sequences in the input string and create a URI. You can see it here:

http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8/uri-generic.scm

You can try iri->uri as follows:

(use uri-generic)
(print (iri->uri "http://example.com/삼계탕"))
(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)

Note that the URI constructor still tries to percent-decode all characters in the path, and in this example this results in unprintable characters being displayed. So I will probably need to add a field to the URI structure that indicates if UTF-8 sequences are included and avoid percent-decoding altogether. Would this be sufficient for your needs?

Your proposed solution to extend the definition of the 'unstructured' character set is in line with RFC 3987, but I need to look some more at the code and see whether it would be possible to have an API where the user can choose whether to use IRIs or URIs. I prefer not to use UTF-8 sequences by default, since this might result in uri-generic based client sending invalid URIs to a server. Let me know what the exact requirements of your application are, and perhaps we can some up with a simple solution.

Ivan

On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <address@hidden> wrote:

As far as I know, revised RFC permits UTF-8 characters in the URL without encoding. Am I wrong here?

Even Solr (the search engine) permits them.

On Mon, Jan 14, 2013 at 1:26 PM, Alex Shinn <address@hidden> wrote:

Hi,

On Mon, Jan 14, 2013 at 12:52 PM, Sungjin Chun <address@hidden> wrote:

First, I might have found wrong place but...

It seems that the main source of the my problem is related to the part of uri-generic.scm, especially;

(define char-set:uri-unreserved

(char-set union char-set:letter+digit (string->char-set "-_.~")))

If I change this part as;

(define char-set:uri-unreserved
(char-set union char-set:letter+digit (string->char-set "-_.~") char-set:hangul))

then, uri/url with korean characters work. How can I set those part more generic one?

I believe the ASCII definition is correct even for Unicode URLs.

You need to represent the URL in utf8 and then use percent
escapes on the utf8 bytes, which is what would happen naturally
here.

--
Alex

_______________________________________________
Chicken-users mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-users

From:	Sungjin Chun
Subject:	Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date:	Tue, 15 Jan 2013 07:35:39 +0900