Go to the first, previous, next, last section, table of contents.


Time-Stamping

One of the most important aspects of mirroring information from the Internet is updating your archives.

Downloading the whole archive again and again, just to replace a few changed files is expensive, both in terms of wasted bandwidth and money, and the time to do the update. This is why all the mirroring tools offer the option of incremental updating.

Such an updating mechanism means that the remote server is scanned in search of new files. Only those new files will be downloaded in the place of the old ones.

A file is considered new if one of these two conditions are met:

  1. A file of that name does not already exist locally.
  2. A file of that name does exist, but the remote file was modified more recently than the local file.

To implement this, the program needs to be aware of the time of last modification of both remote and local files. Such information are called the time-stamps.

The time-stamping in GNU Wget is turned on using `--timestamping' (`-N') option, or through timestamping = on directive in `.wgetrc'. With this option, for each file it intends to download, Wget will check whether a local file of the same name exists. If it does, and the remote file is older, Wget will not download it.

If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the time-stamps say.

Time-Stamping Usage

The usage of time-stamping is simple. Say you would like to download a file so that it keeps its date of modification.

wget -S http://www.gnu.ai.mit.edu/

A simple ls -l shows that the time stamp on the local file equals the state of the Last-Modified header, as returned by the server. As you can see, the time-stamping info is preserved locally, even without `-N'.

Several days later, you would like Wget to check if the remote file has changed, and download it if it has.

wget -N http://www.gnu.ai.mit.edu/

Wget will ask the server for the last-modified date. If the local file is newer, the remote file will not be re-fetched. However, if the remote file is more recent, Wget will proceed fetching it normally.

The same goes for FTP. For example:

wget ftp://ftp.ifi.uio.no/pub/emacs/gnus/*

ls will show that the timestamps are set according to the state on the remote server. Reissuing the command with `-N' will make Wget re-fetch only the files that have been modified.

In both HTTP and FTP retrieval Wget will time-stamp the local file correctly (with or without `-N') if it gets the stamps, i.e. gets the directory listing for FTP or the Last-Modified header for HTTP.

If you wished to mirror the GNU archive every week, you would use the following command every week:

wget --timestamping -r ftp://prep.ai.mit.edu/pub/gnu/

HTTP Time-Stamping Internals

Time-stamping in HTTP is implemented by checking of the Last-Modified header. If you wish to retrieve the file `foo.html' through HTTP, Wget will check whether `foo.html' exists locally. If it doesn't, `foo.html' will be retrieved unconditionally.

If the file does exist locally, Wget will first check its local time-stamp (similar to the way ls -l checks it), and then send a HEAD request to the remote server, demanding the information on the remote file.

The Last-Modified header is examined to find which file was modified more recently (which makes it "newer"). If the remote file is newer, it will be downloaded; if it is older, Wget will give up.(2)

Arguably, HTTP time-stamping should be implemented using the If-Modified-Since request.

FTP Time-Stamping Internals

In theory, FTP time-stamping works much the same as HTTP, only FTP has no headers--time-stamps must be received from the directory listings.

For each directory files must be retrieved from, Wget will use the LIST command to get the listing. It will try to analyze the listing, assuming that it is a Unix ls -l listing, and extract the time-stamps. The rest is exactly the same as for HTTP.

Assumption that every directory listing is a Unix-style listing may sound extremely constraining, but in practice it is not, as many non-Unix FTP servers use the Unixoid listing format because most (all?) of the clients understand it. Bear in mind that RFC959 defines no standard way to get a file list, let alone the time-stamps. We can only hope that a future standard will define this.

Another non-standard solution includes the use of MDTM command that is supported by some FTP servers (including the popular wu-ftpd), which returns the exact time of the specified file. Wget may support this command in the future.


Go to the first, previous, next, last section, table of contents.