Go to the first, previous, next, last section, table of contents.

Following Links

When retrieving recursively, one does not wish to retrieve the loads of unnecessary data. Most of the time the users bear in mind exactly what they want to download, and want Wget to follow only specific links.

For example, if you wish to download the music archive from `fly.cc.fer.hr', you will not want to download all the home pages that happen to be referenced by an obscure part of the archive.

Wget possesses several mechanisms that allows you to fine-tune which links it will follow.

Relative Links

When only relative links are followed (option `-L'), recursive retrieving will never span hosts. No time-expensive DNS-lookups will be performed, and the process will be very fast, with the minimum strain of the network. This will suit your needs often, especially when mirroring the output of various x2html converters, since they generally output relative links.

Host Checking

The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default mode for following links) all URLs the that refer to the same host will be retrieved.

The problem with this option are the aliases of the hosts and domains. Thus there is no way for Wget to know that `regoc.srce.hr' and `www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as `fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is DNS-looked-up with gethostbyname to check whether we are maybe dealing with the same hosts. Although the results of gethostbyname are cached, it is still a great slowdown, e.g. when dealing with large indices of home pages on different hosts (because each of the hosts must be and DNS-resolved to see whether it just might an alias of the starting host).

To avoid the overhead you may use `-nh', which will turn off DNS-resolving and make Wget compare hosts literally. This will make things run much faster, but also much less reliable (e.g. `www.srce.hr' and `regoc.srce.hr' will be flagged as different hosts).

Note that HTTP/1.1 allows one IP address to support several virtual servers, each of them with its own root; this feature is also used by many HTTP/1.0 servers. Such "servers" are then distinguished by their hostnames (all of which point to the same IP address); for this to work, a client must send a Host header, which is what Wget does. However, in that case Wget must not try to divine a host's "real" address, nor try to use the same hostname for each access, i.e. `-nh' must be turned on.

In other words, the `-nh' option must be used to enabling the retrieval from virtual servers distinguished by their hostnames. As the number of such server setups grow, the behavior of `-nh' may become the default in the future.

Domain Acceptance

With the `-D' option you may specify the domains that will be followed. The hosts the domain of which is not in this list will not be DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that nothing outside of MIT gets looked up. This is very important and useful. It also means that `-D' does not imply `-H' (span all hosts), which must be specified explicitly. Feel free to use this options since it will speed things up, with almost all the reliability of checking for all hosts. Thus you could invoke

wget -r -D.hr http://fly.cc.fer.hr/

to make sure that only the hosts in `.hr' domain get DNS-looked-up for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked (only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be checked.

Of course, domain acceptance can be used to limit the retrieval to particular domains with spanning of hosts in them, but then you must specify `-H' explicitly. E.g.:

wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/

will start with `http://www.mit.edu/', following links across MIT and Stanford.

If there are domains you want to exclude specifically, you can do it with `--exclude-domains', which accepts the same type of arguments of `-D', but will exclude all the listed domains. For example, if you want to download all the hosts from `foo.edu' domain, with the exception of `sunsite.foo.edu', you can do it like this:

wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/

All Hosts

When `-H' is specified without `-D', all hosts are freely spanned. There are no restrictions whatsoever as to what part of the net Wget will go to fetch documents, other than maximum retrieval depth. If a page references `www.yahoo.com', so be it. Such an option is rarely useful for itself.

Types of Files

When downloading material from the web, you will often want to restrict the retrieval to only certain file types. For example, if you are interested in downloading GIFS, you will not be overjoyed to get loads of Postscript documents, and vice versa.

Wget offers two options to deal with this problem. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.

`-A acclist'
`--accept acclist'
`accept = acclist'
The argument to `--accept' option is a list of file suffixes or patterns that Wget will download during recursive retrieval. A suffix is the ending part of a file, and consists of "normal" letters, e.g. `gif' or `.jpg'. A matching pattern contains shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'. So, specifying `wget -A gif,jpg' will make Wget download only the files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the other hand, `wget -A "zelazny*196[0-9]*"' will download only files beginning with `zelazny' and containing numbers from 1960 to 1969 anywhere within. Look up the manual of your shell for a description of how pattern matching works. Of course, any number of suffixes and patterns can be combined into a comma-separated list, and given as an argument to `-A'.
`-R rejlist'
`--reject rejlist'
`reject = rejlist'
The `--reject' option works the same way as `--accept', only its logic is the reverse; Wget will download all files except the ones matching the suffixes (or patterns) in the list. So, if you want to download a whole page except for the cumbersome MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'. Analogously, to download all files except the ones beginning with `bjork', use `wget -R "bjork*"'. The quotes are to prevent expansion by the shell.

The `-A' and `-R' options may be combined to achieve even better fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R .ps' will download all the files having `zelazny' as a part of their name, but not the postscript files.

Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise.

Directory-Based Limits

Regardless of other link-following facilities, it is often useful to place the restriction of what files to retrieve based on the directories those files are placed in. There can be many reasons for this--the home pages may be organized in a reasonable directory structure; or some directories may contain useless information, e.g. `/cgi-bin' or `/dev' directories.

Wget offers three different options to deal with this requirement. Each option description lists a short name, a long name, and the equivalent command in `.wgetrc'.

`-I list'
`--include list'
`include_directories = list'
`-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths. So, if you wish to download from `http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in `/cgi-bin', you can specify:
wget -I /people,/cgi-bin http://host/people/bozo/
`-X list'
`--exclude list'
`exclude_directories = list'
`-X' option is exactly the reverse of `-I'---this is a list of directories excluded from the download. E.g. if you do not want Wget to download things from `/cgi-bin' directory, specify `-X /cgi-bin' on the command line. The same as with `-A'/`-R', these two options can be combined to get a better fine-tuning of downloading subdirectories. E.g. if you want to load all the files from `/pub' hierarchy except for `/pub/worthless', specify `-I/pub -X/pub/worthless'.
`no_parent = on'
The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy upper than the beginning directory, i.e. disallowing ascent to the parent directory/directories. The `--no-parent' option (short `-np') is useful in this case. Using it guarantees that you will never leave the existing hierarchy. Supposing you issue Wget with:
wget -r --no-parent http://somehost/~luzer/my-archive/
You may rest assured that none of the references to `/~his-girls-homepage/' or `/~luzer/all-my-mpegs/' will be followed. Only the archive you are interested in will be downloaded. Essentially, `--no-parent' is similar to `-I/~luzer/my-archive', only it handles redirections in a more intelligent fashion.

Following FTP Links

The rules for FTP are somewhat specific, as it is necessary for them to be. FTP links in HTML documents are often included for purposes of reference, and it is often inconvenient to download them by default.

To have FTP links followed from HTML documents, you need to specify the `--follow-ftp' option. Having done that, FTP links will span hosts regardless of `-H' setting. This is logical, as FTP links rarely point to the same host where the HTTP server resides. For similar reasons, the `-L' options has no effect on such downloads. On the other hand, domain acceptance (`-D') and suffix rules (`-A' and `-R') apply normally.

Also note that followed links to FTP directories will not be retrieved recursively further.

Go to the first, previous, next, last section, table of contents.