[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wget | wget should save directory listings as index.html (#11)
From: |
Yaroslav Nikitenko (@ynikitenko) |
Subject: |
Re: wget | wget should save directory listings as index.html (#11) |
Date: |
Thu, 26 May 2022 17:28:08 +0000 |
Yaroslav Nikitenko commented:
According to
[w3techs](https://w3techs.com/technologies/details/ws-microsoftiis),
"Microsoft-IIS is used by 6.0% of all the websites whose web server we know",
so you are probably right.
I agree that the approach to save many versions is possible with
*--convert-links*. If I understand right, `directory` and `directory/` will be
stored as `directory.1` and `directory.2` if `directory/x` is found (because
you didn't write that `wget2` would save `directory/` as
`directory/index.html`). Here the order of .1/.2 is arbitrary as well, but
probably not important because of link conversion. So I'll write about the
default mirroring (without *--convert-links*).
To mirror a site so that we save as many its pages as we can looks important,
but to do it always correctly is in general impossible.
`wget` saves to file system. For `directory`, `directory/` and
`directory/index.html` it can save only one correct file (`index.html`, which
will directly correspond to the original site), two other files (if they exist
and are all different) will have to be saved as new files, which can always:
1) conflict with another path from the site (as it is with `index.html`;
however, the site can also have `directory.1`! Like
https://docs.djangoproject.com/en/4.1/)
2) falsely represent a non-existent site page. What if we never had
`directory.1` on our site and become surprised that it appears in our
downloaded files? This may be pretty minor, but I explicitly forbid my server
to serve `dir/index.html`as a separate path just to avoid content duplication
with `dir`.
This process of fixing new names can be endless (if we have `directory.1`, we
save to `directory.2`, etc; and the same with `index` and `index.1`, etc) and
the final difference between `directory.1` or `index.1.html` is probably
absent. To ignore same names completely may be a not any less justified option.
This is why I think that this can be solved
a) through complicated options, with which the user can describe the exact
algorithm they want with their site,
b) as a "default" solution for most cases. This is not a solution in a strict
sense if we want to browse the site locally; but as I wrote, that does not
exist in general. To save "difficult" paths or not and what names they will
have in this case is optional (not very important).
For the default algorithm, I will start with the preferences.
- `directory/index.html` should have the highest priority, because
`directory`/`directory/` can be auto-generated listings, and are thus less
important. I don't know what `wget2` tracks, but if it knows that `directory`
was saved as `index.html` before, it should replace that with the actual
`index.html` (what happens to that version of `directory` is optional).
- `directory`/`directory/` are more or less the same. If there is no
`directory/index.html` saved (and when we learn that it is a directory when we
see a slash after that), save that as `index.html`, because this is a "native"
(most basic) representation of a web directory. If `index.html` already exists,
this can be because of 1) real `index.html` 2) previous save from directory
with alternative version of slash. In that case the new version is saved to an
optional name (because in the case 1 it has a lower priority than `index.html`
and in the case 2 users typically navigate from the root of the site, and if
the previous link was found earlier, it should probably have a higher weight;
if both `directory` and `directory/` are found on one page, this may still
hold, because more important links are closer to the top). It seems that this
algorithm slightly contradicts to what I wrote about `index` above (the site
master could forget that they have an old `index.html` and use just directory
paths on the main page); maybe you have some concrete examples which we could
see to select the better priorities?
In case we see `directory/whatever`, we should rename `directory` to
`directory/index.html` (unless that file already exists, which I discussed
above).
--
Reply to this email directly or view it on GitLab:
https://gitlab.com/gnuwget/wget/-/issues/11#note_961446277
You're receiving this email because of your account on gitlab.com.