Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

From:

Alex Shinn

Subject:

Date:

Tue, 15 Jan 2013 19:30:07 +0900

On Tue, Jan 15, 2013 at 6:23 PM, Peter Bex <address@hidden> wrote:

On Tue, Jan 15, 2013 at 06:07:06PM +0900, Alex Shinn wrote:
> On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov <address@hidden>wrote:
>
> >
> > Percent-encoded sequences of more than one octet will not get touched by
> > pct-decode in the current implementation, so you will not get double
> > escaping. Percent-encoded sequences of one octet will get decoded if they
> > fall in the "unstructured" char-set, as per RFC 3986.
> >
>
> OK, now I'm thoroughly confused. The percent-encoding is context sensitive?
> How can this not be broken?
>
> We need to make the design clear:
>
> * What can be constructed directly with make-uri.
> * What can be parsed, and how this is passed to make-uri.
> * How URIs are represented internally.
> * How URIs are encoded on output.
>
> It sounds like uri-common and uri-generic are doing different things here.

uri-generic is agnostic about specific encodings and types.
uri-common is designed to make life simpler in the case of "common" URIs
like HTTP where we know what types of characters are to be decoded.

RFC3986 "special characters" cannot be decoded unless we know they have
no special meaning. uri-common just decodes everything fully because
there is generally no deeper nested encoding involved. It's also smart
enough to know that port 80 belongs to http, so it can be omitted,
whereas uri-generic can't make such assumptions.

uri-common also makes the assumption that query args are
x-www-form-urlencoded. This is the main reason to prefer it for web
programming; uri-generic doesn't know about form-encoding because that
is really only used in the context of HTML (it's strictly not even a
HTTP thing), so this messy stuff should stay out of the generic URI
library.

Yes, the web is evil and must die.

Right, I'm familiar with the evil standards :) I'm also hoping that we can

have some basic compatibility between Chicken's uri module and Chibi's

(and whatever R7RS WG2 comes up with).

It seems to me the sane thing to do is represent URIs unencoded

internally, which can be generated directly with make-uri or decoded

on parsing. The decoding might be schema-specific, although

really the only difference is the space-to-+ and query args encoding.

Then, on output we would encode as needed.

I was confused because the uri-generic change Ivan suggests

seems to be putting encoded characters directly in the representation,

whereas uri-common is encoding only on output.

[It also looks like the uri-common encoding is broken - why were bytes

getting lost?]

Finally, regarding parsing I still don't understand why %AB is decoded

into the corresponding octet but %AB%CD is not?

Alex