|
From: | Sungjin Chun |
Subject: | Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri. |
Date: | Tue, 15 Jan 2013 07:35:39 +0900 |
IvanYour proposed solution to extend the definition of the 'unstructured' character set is in line with RFC 3987, but I need to look some more at the code and see whether it would be possible to have an API where the user can choose whether to use IRIs or URIs. I prefer not to use UTF-8 sequences by default, since this might result in uri-generic based client sending invalid URIs to a server. Let me know what the exact requirements of your application are, and perhaps we can some up with a simple solution.Note that the URI constructor still tries to percent-decode all characters in the path, and in this example this results in unprintable characters being displayed. So I will probably need to add a field to the URI structure that indicates if UTF-8 sequences are included and avoid percent-decoding altogether. Would this be sufficient for your needs?Hi Sungjin,Thanks for trying to use the uri-generic library. As Peter already pointed out, uri-generic and uri-common are intended to implement RFC 3986 (URIs), and so far no effort has been done to support RFC 3987 (IRIs). However, the IRI RFC does define a mapping from IRI to URI, where Unicode characters in IRIs are converted to percent encoded UTF-8 sequences. The caveat here is that if you try to decode these percent-encoded sequences they will likely result in invalid URI characters. I have prototyped a procedure iri->uri which attempts to percent-encode all UTF-8 sequences in the input string and create a URI. You can see it here:
http://bugs.call-cc.org/browser/release/4/uri-generic/branches/utf8/uri-generic.scm
You can try iri->uri as follows:
(use uri-generic)
(print (iri->uri "http://example.com/삼계탕"))
(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "�%82%BC�%B3%84�%83%95") query=#f fragment=#f)
On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <address@hidden> wrote:
_______________________________________________As far as I know, revised RFC permits UTF-8 characters in the URL without encoding. Am I wrong here?Even Solr (the search engine) permits them.
On Mon, Jan 14, 2013 at 1:26 PM, Alex Shinn <address@hidden> wrote:Hi,then, uri/url with korean characters work. How can I set those part more generic one?If I change this part as;(char-set union char-set:letter+digit (string->char-set "-_.~")))(define char-set:uri-unreservedFirst, I might have found wrong place but...It seems that the main source of the my problem is related to the part of uri-generic.scm, especially;
(define char-set:uri-unreserved
(char-set union char-set:letter+digit (string->char-set "-_.~") char-set:hangul))
I believe the ASCII definition is correct even for Unicode URLs.You need to represent the URL in utf8 and then use percentescapes on the utf8 bytes, which is what would happen naturallyhere.--Alex
Chicken-users mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/chicken-users
[Prev in Thread] | Current Thread | [Next in Thread] |