[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget mirror site failing due to file / directory name cla
From: |
Ángel González |
Subject: |
Re: [Bug-wget] wget mirror site failing due to file / directory name clashes |
Date: |
Sat, 13 Oct 2012 15:44:46 +0200 |
User-agent: |
Thunderbird |
On 12/10/12 15:38, Paul Beckett (ITCS) wrote:
> I am attempting to use wget to create a mirrored copy of a CMS (Liferay)
> website. I want to be able to failover to this static copy in case the
> application server goes offline. I therefore need the URL's to remain
> absolutely identical. The problem I have is that I cannot figure out how I
> can configure wget in a way that will cope with:
> http://www.example.com/about
> http://www.example.com/about/something
>
> In this case either the file or directory 'about' already exists at prevents
> the second being created.
>
> Initially I though the most obvious solution, was to rely on Apache's
> DirectoryIndex, and save the files as:
> /about/index.html
> /about/something/index.html
>
> But, currently I can't figure out how I can do this in a way that doesn't
> break either the relative path to other pages or create links to the
> index.html rather than the original location. I need the links (a href etc.)
> to still go to /about and not explicitly call /index.html - as this will mean
> people may bookmark things that won't exist when the CMS came back.
>
> If anyone can offer me any advice on how I can achieve this (either correct
> options), or how I could patch the source code to achieve this, I would be
> extremely grateful.
>
> Thanks,
> Paul
>
>
>
> /usr/local/bin/wget --background --append-output=/tmp/wget-log --no-verbose
> --tries=20 --waitretry=10 --retry-connrefused --limit-rate=100m
> --quota=10000m --timestamping
> --directory-prefix=/usr/local/apache2/content/uk.ac.uea.www_flat2
> --protocol-directories --user-agent="UEA WebSite Flattener"
> --backup-converted -e robots=off --page-requisites --convert-links
> --recursive --level=inf --trust-server-names --domains example.com
> www.example.com
Download with --adjust-extension
This way, you will get:
/about.html
/about/something.html
Then configure the root of the static copy:
RewriteEngine On
RewriteCond %{SCRIPT_FILENAME} !\.html$
RewriteRule ^(.*[^/])/?$ $1.html
to append the .html extension to the requested urls.
If your CMS returns non-html contents on some urls you
will need to adjust this to exclude them from the rewrite.
Also, I'd remove --convert-links from the command line, since you want the same
page contents as the real pages.