wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget | wget should save directory listings as index.html (#11)


From: Yaroslav Nikitenko (@ynikitenko)
Subject: Re: wget | wget should save directory listings as index.html (#11)
Date: Wed, 25 May 2022 15:19:53 +0000



Yaroslav Nikitenko commented:


You are right that this comes exactly from this site vs filesystem distinction. 
To judge about the desired behaviour I would think about different user aims.

- create a static version of the site.

This is my case. The site has many pages and I'd like to download automatically 
as many of them as possible. 

The static version is meant to be used by other people, so I'd like every page 
to work as it is (if I wanted complex behaviour, I would also leave it with the 
CMS). And in this case `directory.1` would just not work, because the simplest 
file server will return `index.html` for a directory, but not some 
`directory.1` (neither users, nor site links will know nothing about 
`directory.1`).

- save as much information from the site as possible.

This might be useful for the site owner. In that case it would be really 
reasonable to handle all corner cases, and save all `directory`, `directory/` 
and `directory/index.html` as distinct files (maybe on file systems that allow 
slashes in file names, but I'm not sure that that would still work with most 
file servers).

Honestly, I don't think that to have different content for `directory` and 
`directory/` is a good idea. This is pretty obvious, and e.g. [Google 
says](https://developers.google.com/search/blog/2010/04/to-slash-or-not-to-slash)
 "Your users, however, may find this configuration horribly confusing". I think 
similarly about `index.html`: if that file is present, then probably there is 
no need to create different contents for the page of its parent directory. So 
my final preference would be to handle most sites most automatically, and to 
support a separate option for overly complicated / wrong urls (I don't think 
they should be preferred over the cleaner structure).

Maybe you are right that "directories" should end in a slash. This was not the 
case for my site (mostly created by other people) and this is not a uniformly 
accepted guideline (Google on a link above 12 years ago says that both versions 
are possible), so IMHO it would be nice for `wget` to handle these cases 
uniformly.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget/-/issues/11#note_959991717
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]