[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [PATCH] Make wget capable of starting download from a spe
From: |
Yousong Zhou |
Subject: |
Re: [Bug-wget] [PATCH] Make wget capable of starting download from a specified position. |
Date: |
Sat, 21 Dec 2013 17:24:47 +0800 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
On Sat, Dec 21, 2013 at 01:51:04PM +0530, Darshit Shah wrote:
> I have a few comments on the patch. Commenting inline.
Thank you.
>
> On Sat, Dec 21, 2013 at 12:32 PM, Yousong Zhou <address@hidden>wrote:
>
> > This patch adds an option `--start-pos' for specifying starting position
> > of a download, both for HTTP and FTP. When specified, the newly added
> > option would override `--continue'. Apart from that, no existing code
> > should be affected.
> >
> > Signed-off-by: Yousong Zhou <address@hidden>
> > ---
> > Hi,
> >
> > I found myself needed this feature when I was trying to tunnel the
> > download of
> > big file (several gigabytes) from a remote machine back to local through a
> > somewhat flaky connection. It's a pain both for the server and local
> > network
> > users if we have to repeat the previously already downloaded part in case
> > that
> > the connection hangs or breaks. Specifying 'Range: ' header is not an
> > option
> > for wget (integrity check in the code would fail), and curl is not fast
> > enough.
> > So I decided to make this patch in hope that this can also be useful to
> > someone
> > else.
> >
> > What integrity check would fail on using the Range Header? And if you
> already have a partially downloaded file why is using the --continue switch
> on an option?
`--continue` only works if there is already a partially downloaded file
on disk. Otherwise, specifying `-c' will only tell wget to start from
scratch.
By 'Range: ' header I mean headers specified by `--header'. If the
server sends back a 'Content-Range: ' header in the response, wget would
think that it's unexpected or not matching what's already on the disk
(would be zero if there is no file on disk). If I get the code right,
the check is at `http.c:gethttp()':
2744 if ((contrange != 0 && contrange != hs->restval)
2745 || (H_PARTIAL (statcode) && !contrange))
2746 {
2747 /* The Range request was somehow misunderstood by the server.
2748 Bail out. */
2749 xfree_null (type);
2750 CLOSE_INVALIDATE (sock);
2751 xfree (head);
2752 return RANGEERR;
2753 }
In my situation, wget was trigger on the remote machine like the
following:
wget -O - --start-pos "$OFFSET" "$URL" | nc -lp 7193
Then on local machine, I would download with:
nc localhost 7193
Before these, a local forwarding tunnel has been setup with ssh to make
this possible. So in this case, there was no local file on the machine
where wget was triggerred and `--continue' will not work. I am sure
there are other cases `--start-pos' would be useful and that
`--start-pos' would make wget more complete.
>
> yousong
> >
> > doc/ChangeLog | 4 ++++
> > doc/wget.texi | 14 ++++++++++++++
> > src/ChangeLog | 9 +++++++++
> > src/ftp.c | 2 ++
> > src/http.c | 2 ++
> > src/init.c | 1 +
> > src/main.c | 1 +
> > src/options.h | 1 +
> > 8 files changed, 34 insertions(+), 0 deletions(-)
> >
> > diff --git a/doc/ChangeLog b/doc/ChangeLog
> > index 3b05756..df103c8 100644
> > --- a/doc/ChangeLog
> > +++ b/doc/ChangeLog
> > @@ -1,3 +1,7 @@
> > +2013-12-21 Yousong Zhou <address@hidden>
> > +
> > + * wget.texi: Add documentation for --start-pos.
> > +
> > 2013-10-06 Tim Ruehsen <address@hidden>
> >
> > * wget.texi: add/explain quoting of wildcard patterns
> > diff --git a/doc/wget.texi b/doc/wget.texi
> > index 4a1f7f1..166ea08 100644
> > --- a/doc/wget.texi
> > +++ b/doc/wget.texi
> > @@ -701,6 +701,20 @@ Another instance where you'll get a garbled file if
> > you try to use
> > Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
> > servers that support the @code{Range} header.
> >
> > address@hidden offset
> > address@hidden continue retrieval
> > address@hidden incomplete downloads
> > address@hidden resume download
> > address@hidden start position
> > address@hidden address@hidden
> > +Start the download at position @var{OFFSET}. Offset may be expressed in
> > bytes,
> > +kilobytes with the `k' suffix, or megabytes with the `m' suffix.
> > +
> > +When specified, it would override the behavior of @samp{--continue}. When
> > +using this option, you may also want to explicitly specify an output
> > filename
> > +with @samp{-O FILE} in order to not overwrite an existing partially
> > downloaded
> > +file.
> > +
> > @cindex progress indicator
> > @cindex dot style
> > @item address@hidden
> > diff --git a/src/ChangeLog b/src/ChangeLog
> > index 42ce3e4..ab8a496 100644
> > --- a/src/ChangeLog
> > +++ b/src/ChangeLog
> > @@ -1,3 +1,12 @@
> > +2013-12-21 Yousong Zhou <address@hidden>
> > +
> > + * options.h: Add option --start-pos to specify start position of
> > + a download.
> > + * main.c: Same purpose as above.
> > + * init.c: Same purpose as above.
> > + * http.c: Utilize opt.start_pos for HTTP download.
> > + * ftp.c: Utilize opt.start_pos for FTP retrieval.
> > +
> > 2013-11-02 Giuseppe Scrivano <address@hidden>
> >
> > * http.c (gethttp): Increase max header value length to 512.
> > diff --git a/src/ftp.c b/src/ftp.c
> > index c2522ca..c7ab6ef 100644
> > --- a/src/ftp.c
> > +++ b/src/ftp.c
> > @@ -1632,6 +1632,8 @@ ftp_loop_internal (struct url *u, struct fileinfo
> > *f, ccon *con, char **local_fi
> > /* Decide whether or not to restart. */
> > if (con->cmd & DO_LIST)
> > restval = 0;
> > + else if (opt.start_pos)
> > + restval = opt.start_pos;
> > else if (opt.always_rest
> > && stat (locf, &st) == 0
> > && S_ISREG (st.st_mode))
> > diff --git a/src/http.c b/src/http.c
> > index 754b7ec..a354c6b 100644
> > --- a/src/http.c
> > +++ b/src/http.c
> > @@ -3098,6 +3098,8 @@ Spider mode enabled. Check if remote file
> > exists.\n"));
> > /* Decide whether or not to restart. */
> > if (force_full_retrieve)
> > hstat.restval = hstat.len;
> > + else if (opt.start_pos)
> > + hstat.restval = opt.start_pos;
> > else if (opt.always_rest
> > && got_name
> > && stat (hstat.local_file, &st) == 0
> > diff --git a/src/init.c b/src/init.c
> > index 84ae654..7f7a34e 100644
> > --- a/src/init.c
> > +++ b/src/init.c
> > @@ -271,6 +271,7 @@ static const struct {
> > { "showalldnsentries", &opt.show_all_dns_entries, cmd_boolean },
> > { "spanhosts", &opt.spanhost, cmd_boolean },
> > { "spider", &opt.spider, cmd_boolean },
> > + { "startpos", &opt.start_pos, cmd_bytes },
> > { "strictcomments", &opt.strict_comments, cmd_boolean },
> > { "timeout", NULL, cmd_spec_timeout },
> > { "timestamping", &opt.timestamping, cmd_boolean },
> > diff --git a/src/main.c b/src/main.c
> > index 19d7253..4fbfaee 100644
> > --- a/src/main.c
> > +++ b/src/main.c
> > @@ -281,6 +281,7 @@ static struct cmdline_option option_data[] =
> > { "server-response", 'S', OPT_BOOLEAN, "serverresponse", -1 },
> > { "span-hosts", 'H', OPT_BOOLEAN, "spanhosts", -1 },
> > { "spider", 0, OPT_BOOLEAN, "spider", -1 },
> > + { "start-pos", 0, OPT_VALUE, "startpos", -1 },
> > { "strict-comments", 0, OPT_BOOLEAN, "strictcomments", -1 },
> > { "timeout", 'T', OPT_VALUE, "timeout", -1 },
> > { "timestamping", 'N', OPT_BOOLEAN, "timestamping", -1 },
> > diff --git a/src/options.h b/src/options.h
> > index ad89627..9ba1760 100644
> > --- a/src/options.h
> > +++ b/src/options.h
> > @@ -115,6 +115,7 @@ struct options
> > bool ask_passwd; /* Ask for password? */
> >
> > bool always_rest; /* Always use REST. */
> > + wgint start_pos; /* Start position of a download. */
> > char *ftp_user; /* FTP username */
> > char *ftp_passwd; /* FTP password */
> > bool netrc; /* Whether to read .netrc. */
>
>
> Simply setting the startpos and restval values in Wget will not help you if
> the server will keep sending the file from scratch on every request. The
> only way to partially download a file is by using the Range header. Because
> then, the server will send only the requested parts of the file. If a
Setting the value will let wget send an appropriate 'Range: ' header or 'REST'
command to the server. Below is two demo for this patch.
address@hidden:~/wget/src$ ./wget -O - --start-pos=1213055
ftp://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz | wc -c
--2013-12-21 17:15:05-- ftp://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz
=> '-'
Resolving ftp.gnu.org... 2001:4830:134:3::b, 208.118.235.20
Connecting to ftp.gnu.org|2001:4830:134:3::b|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /gnu/wget ... done.
==> SIZE wget-1.10.2.tar.gz ... 1213056
==> EPSV ... done. ==> REST 1213055 ... done.
==> RETR wget-1.10.2.tar.gz ... done.
Length: 1213056 (1.2M), 1 remaining (unauthoritative)
100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>]
1,213,056 --.-K/s in 0s
2013-12-21 17:15:10 (136 KB/s) - written to stdout [1213056]
1
address@hidden:~/wget/src$ ./wget -O - --start-pos=1213055
http://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz | wc -c
--2013-12-21 17:18:45-- http://ftp.gnu.org/gnu/wget/wget-1.10.2.tar.gz
Resolving ftp.gnu.org... 2001:4830:134:3::b, 208.118.235.20
Connecting to ftp.gnu.org|2001:4830:134:3::b|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1213056 (1.2M), 1 remaining [application/x-gzip]
Saving to: 'STDOUT'
100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>]
1,213,056 --.-K/s in 0s
2013-12-21 17:18:47 (112 KB/s) - written to stdout [1213056/1213056]
1
address@hidden:~/wget/src$
> server does not support the range header, you really have no other option
> but need to resume downloading the file from scratch. This is a limitation
> of the HTTP protocol and no client can help you get around it.
Yes. I will add this to the doc.
>
>
> Also, please send patches as git format-patch attachments. Email clients
> often mangle with spaces and newlines and cause problems when applying them.
The patch was formatted with git-format-patch then sent with git-send-email.
I will send them as attachments if they are preferred here.
Thank you for reviewing this. More suggestions?
yousong
>
> --
> Thanking You,
> Darshit Shah