Go to the first, previous, next, last section, table of contents.


Appendices

This chapter contains some references I consider useful, like the Robots Exclusion Standard specification, as well as a list of contributors to GNU Wget.

Robots

Since Wget is able to traverse the web, it counts as one of the Web robots. Thus Wget understands Robots Exclusion Standard (RES)---contents of `/robots.txt', used by server administrators to shield parts of their systems from wanderings of Wget.

Norobots support is turned on only when retrieving recursively, and never for the first page. Thus, you may issue:

wget -r http://fly.cc.fer.hr/

First the index of fly.cc.fer.hr will be downloaded. If Wget finds anything worth downloading on the same host, only then will it load the robots, and decide whether or not to load the links after all. `/robots.txt' is loaded only once per host. Wget does not support the robots META tag.

The description of the norobots standard was written, and is maintained by Martijn Koster <m.koster@webcrawler.com>. With his permission, I contribute a (slightly modified) texified version of the RES.

Introduction to RES

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

This document represents a consensus on 30 June 1994 on the robots mailing list (robots@webcrawler.com), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial organization. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

The latest version of this document can be found at:

http://info.webcrawler.com/mak/projects/robots/norobots.html

RES Format

The format and semantics of the `/robots.txt' file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR, CR/NL, or NL). Each record contains lines of the form:

<field>:<optionalspace><value><optionalspace>

The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the `#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognized headers are ignored.

The presence of an empty `/robots.txt' file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

User-Agent Field

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is `*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the `/robots.txt' file.

Disallow Field

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, `Disallow: /help' disallows both `/help.html' and `/help/index.html', whereas `Disallow: /help/' would disallow `/help/index.html' but allow `/help.html'.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Norobots Examples

The following example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/' or `/tmp/':

# robots.txt for http://www.site.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear

This example `/robots.txt' file specifies that no robots should visit any URL starting with `/cyberworld/map/', except the robot called `cybermapper':

# robots.txt for http://www.site.com/

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:

This example indicates that no robots should visit this site further:

# go away
User-agent: *
Disallow: /

Security Considerations

When using Wget, you must be aware that it sends unencrypted passwords through the network, which may present a security problem. Here are the main issues, and some solutions.

  1. The passwords on the command line are visible using ps. If this is a problem, avoid putting passwords from the command line--e.g. you can use `.netrc' for this.
  2. Using the insecure basic authentication scheme, unencrypted passwords are transmitted through the network routers and gateways.
  3. The FTP passwords are also in no way encrypted. There is no good solution for this at the moment.
  4. Although the "normal" output of Wget tries to hide the passwords, debugging logs show them, in all forms. This problem is avoided by being careful when you send debug logs (yes, even when you send them to me).

Contributors

GNU Wget was written by Hrvoje Nik@v{s}i'{c} <hniksic@srce.hr>. However, its development could never have gone as far as it has, were it not for the help of many people, either with bug reports, feature proposals, patches, or letters saying "Thanks!".

Special thanks goes to the following people (no particular order):

The following people have provided patches, bug/build reports, useful suggestions, beta testing services, fan mail and all the other things that make maintenance so much fun:

Tim Adam, Martin Baehr, Dieter Baron, Roger Beeman and the Gurus at Cisco, Mark Boyns, John Burden, Wanderlei Cavassin, Gilles Cedoc, Noel Cragg, Kristijan @v{C}onka@v{s}, Damir D@v{z}eko, Andrew Davison, Ulrich Drepper, Marc Duponcheel, Aleksandar Erkalovi'{c}, Andy Eskilsson, Masashi Fujita, Marcel Gerrits, Karl Heuer, Gregor Hoffleit, Erik Magnus Hulthen, Richard Huveneers, Simon Josefsson, Mario Juri'{c}, Goran Kezunovi'{c}, Robert Kleine, Fila Kolodny, Martin Kraemer, Tage Stabell-Kulo, Hrvoje Lacko, Jordan Mendelson, Charlie Negyesi, Francois Pinard, Andrew Pollock, Steve Pothier, Marin Purgar, Jan Prikryl, Keith Refson, Tobias Ringstrom, Robert Schmidt, Sven Sternberger, Markus Strasser, Mike Thomas, Russell Vincent, Douglas E. Wegscheid, Jasmin Zainul, Bojan @v{Z}drnja, Kristijan Zimmer.

Thanks everyone; I've wouldn't have done it without you. Apologies to all who I accidentally left out. Also thanks to all the subscribers of the Wget mailing list.


Go to the first, previous, next, last section, table of contents.