wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CRITICAL BUG: wget -N is leaving corrupted files


From: Romain Morotti (London)
Subject: CRITICAL BUG: wget -N is leaving corrupted files
Date: Fri, 24 May 2024 11:53:47 +0000

Hello,

Apologies for the long email, it is quite long and was quite difficult to 
debug. I hope you can roll a fix.
There are previous bug reports related to this issue, but they never reached a 
repro or an explanation.


TL;DR critical bug in wget, wget is leaving corrupted files when using the -N 
flag.

ROOT CAUSE: Change of behaviour in or around version v1.17. wget -N code was 
rewritten and a new flag was added --no-if-modified-since off by default, 
unfortunately the new code and behaviour is incorrect and leaves corrupted 
downloads.

FIX: -N must always be used together with --no-if-modified-since behavior, 
otherwise wget will leave corrupted files.
The flag --no-if-modified-since should be set by default when -N is used.

WORKAROUND: As a workaround, you can set together “-N --no-if-modified-since” 
in the command line, however the flag does not exist on older versions of wget 
and will fail. You may have to detect wget versions and pass relevant flags if 
you plan to deploy on multiple systems with various wget versions.


CONTEXT:

We use wget to download archives and large files to deploy. We started getting 
regular issues with corrupted archives after moving to ubuntu 22 and latest 
version of wget.


```
              $ wget -N https://mycompany.com/myarchive.tar.gz
              $ tar -xf myarchive.tar.gz
                (stdin): File ends unexpectedly at pos 94479367
              tar: Unexpected EOF in archive
              tar: Unexpected EOF in archive
              tar: Error is not recoverable: exiting now
```

It took me forever to get to the bottom of it, it's an issue with wget leaving 
partial corrupted downloads. It is a bug in wget itself.
wget -N flag is meant to only (re)download a file when the timestamp of the 
file or the file size has changed. It simply stopped working as expected in 
recent versions, like the recent version in ubuntu 22.


We see the issue happening regularly in production,
It triggers after wget is interrupted once. Interruptions can happen for any 
reasons, like the user can Ctrl+C a script, a deployment can be cancelled, the 
process can be killed or the machine rebooted any moment.
When wget is interrupted, it leaves a partial downloaded file. The timestamp is 
newer but the size doesn't match the expected file size.

* In older versions of wget, wget was sending a HEAD request to get the 
filesize and the timestamp, then it downloaded the file if the date changed or 
the sized changed. wget worked as expected.
* In recent versions of wget, wget does not detect the file size is incorrect. 
wget is stuck with a bad file and can never recover.

Recovery requires intervention from a developer or SRE to go onto the affected 
machine and delete bad files leftover by wget.


REPRO:

You can Ctrl+C to interrupt wget or you can run “truncate” to simulate a 
partial download.

```
wget --version
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
truncate --size 1 myarchive.tar.lz
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
```


DEBUGGING: see logs below for the last call to wget, after truncate

Notice in recent versions, wget is sending a single GET request with an 
if-modified-since header, the server replies with a 304 response to tell the 
content did not change.
The 304 response has no content-size header and no content.

This is an edge case of the HTTP spec. The content-size header is not required 
on a 304 response. The header may be set but it is not required.
Having a look at the web server response (artifactory/tomcat), the content-size 
is not set.
See HTTP RFC https://datatracker.ietf.org/doc/html/rfc7232#section-4.1

This is a very interesting side effect of the HTTP spec and the real world. It 
prevents wget from knowing about the file size or getting the content.
Turns out, detecting the file size is critical for "wget -N" to operate as 
expected. Otherwise it will get into a bad state where a file on disk is bad 
but wget can’t detect the issue and can’t redownload.

I think wget must always send a HEAD request first.


```
wget 1.14 on centos 7
works as expected, send a HEAD request, detect the size has changed, then 
redownload

wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:52--  https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 24 May 2024 09:42:52 GMT
  Content-Type: application/octet-stream
  Content-Length: 185751081
  Connection: keep-alive
  Server: Artifactory
  X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
  X-Artifactory-Node-Id: dc09bebb5d42
  Last-Modified: Thu, 23 May 2024 10:46:13 GMT
  ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
  X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
  Accept-Ranges: bytes
  X-Artifactory-Filename: myarchive.tar.lz
  Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
The sizes do not match (local 1) -- retrieving.

--2024-05-24 10:42:52--  https://mycompany.com/myarchive.tar.lz
Reusing existing connection to mycompany.com:443.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Fri, 24 May 2024 09:42:52 GMT
  Content-Type: application/octet-stream
  Content-Length: 185751081
  Connection: keep-alive
  Server: Artifactory
  X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
  X-Artifactory-Node-Id: dc09bebb5d42
  Last-Modified: Thu, 23 May 2024 10:46:13 GMT
  ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
  X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
  Accept-Ranges: bytes
  X-Artifactory-Filename: myarchive.tar.lz
  Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
Saving to: ‘myarchive.tar.lz’

100%[==============================================================================>]
 185,751,081  277MB/s   in 0.6s

2024-05-24 10:42:53 (277 MB/s) - ‘myarchive.tar.lz’ saved [185751081/185751081]
```


```
wget 1.21 on ubuntu 22
doesn’t work. wget incorrectly think there is nothing to download.

wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:11--  https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 304 Not Modified
  Date: Fri, 24 May 2024 09:42:11 GMT
  Connection: keep-alive
  Server: Artifactory
  X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
  X-Artifactory-Node-Id: dc09bebb5d42
  Last-Modified: Thu, 23 May 2024 10:46:13 GMT
  ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
  X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
  X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
  Accept-Ranges: bytes
  X-Artifactory-Filename: myarchive.tar.lz
  Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
File ‘myarchive.tar.lz’ not modified on server. Omitting download.
```



Regards.





This email has been sent by a member of the Man group (“Man”). Man's parent 
company, Man Group plc, is registered in Jersey (company number 127570) with 
its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The 
contents of this email are for the named addressee(s) only. It contains 
information which may be confidential and privileged. If you are not the 
intended recipient, please notify the sender immediately, destroy this email 
and any attachments and do not otherwise disclose or use them. Email 
transmission is not a secure method of communication and Man cannot accept 
responsibility for the completeness or accuracy of this email or any 
attachments. Whilst Man makes every effort to keep its network free from 
viruses, it does not accept responsibility for any computer virus which might 
be transferred by way of this email or any attachments. This email does not 
constitute a request, offer, recommendation or solicitation of any kind to buy, 
subscribe, sell or redeem any investment instruments or to perform other such 
transactions of any kind. Man reserves the right to monitor, record and retain 
all electronic and telephone communications through its network in accordance 
with applicable laws and regulations.

During the course of our business relationship with you, we may process your 
personal data, including through the monitoring of electronic communications. 
We will only process your personal data to the extent permitted by laws and 
regulations; for the purposes of ensuring compliance with our legal and 
regulatory obligations and internal policies; and for managing client 
relationships. For further information please see our Privacy Notice: 
https://www.man.com/privacy-policy

reply via email to

[Prev in Thread] Current Thread [Next in Thread]