Re: wget2 | URL parser does unwanted transformations of URL (#598)

wget-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | URL parser does unwanted transformations of URL (#598)

From:	Nikita Ofitserov (@himikof)
Subject:	Re: wget2 \| URL parser does unwanted transformations of URL (#598)
Date:	Tue, 23 Aug 2022 18:51:02 +0000



Nikita Ofitserov commented:


There is a even more interesting consequence of this: Metalink files with more 
than one query parameter in the URL result in wrong (mangled) URLs being 
downloaded, and file names could be mangled too.

This is a simple example (the raw uri is 
`https://example.com/a&b.txt?apikey=foo&log=1`):
```xml
<?xml version='1.0' encoding='utf-8'?>
<metalink xmlns="urn:ietf:params:xml:ns:metalink">
  <file name="a&amp;b.txt">
    <size>42</size>
    <url>https://example.com/a&amp;b.txt?apikey=foo&amp;log=1</url>
  </file>
</metalink>
```

The metalink code just calls `wget_iri_parse` on the `url` element text 
contents, which actually calls `wget_iri_unescape_inline` a few times inside, 
but [this 
code](https://gitlab.com/gnuwget/wget2/-/blob/ed80255d/libwget/iri.c#L603) 
explicitly refuses to unescape the query part!
> `/* do not unescape query else we get ambiguity for chars like &, =, +, ... 
> */`

So while the ampersand in the URL path is unescaped, the ones in the file name 
and URL query part are not, and a wrong URL is being downloaded and saved to a 
wrong file name...

Also, while reading the metalink code I realized that it silently (and wrongly) 
assumes that the metalink XML contains only a single file, though it is 
probably a separate issue.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/598#note_1074999600
You're receiving this email because of your account on gitlab.com.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: wget2 | URL parser does unwanted transformations of URL (#598), Nikita Ofitserov (@himikof) <=

Prev by Date: Re: wget2 | Fix robots.txt parser (!510)
Previous by thread: Re: wget2 | Fix robots.txt parser (!510)
Index(es):
- Date
- Thread