To use wget to mirror a site, it is necessary to get the session cookies and to also exclude the /logout/
directory.
wget can use a cookies.txt file, but there is no really easy way to generate one from Firefox.
The workaround is to get the cookie using the preferences. The cookie to look for centerSessionKey
, for example, e3scenterSessionKey
.
Then create a cookies.txt
file that is comma separated, for example:
.www.e3s-center.org TRUE / FALSE 1603937999 e3scenterSessionKey be3b829af7acf08418cfcf71fbb8d92d
Note that the 5th arg is the expiration time, which should be far in the future
Then run the following command on moog:
wget -X '/logout/' --load-cookies cookies.txt -m https://www.e3s-center.org >& e3s.out &
Look in /export/home1/tmp/php.err
for lines like:
[28-Aug-2017 16:09:51] [client 128.32.48.150] pubs.php: download pub html_refuse: , E3S_Apr72011_XZhao&delAlamo.pdf, only logged in users
In the above, 128.32.48.150 is moog's address.
Here, the problem was that the /logout/
link was hit.
export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27" export DOMAIN_NAME_TO_SAVE="www.example.com" export SPECIFIC_HOSTNAMES_TO_INCLUDE="example1.com,example2.com,images.example2.com" export FILES_AND_PATHS_TO_EXCLUDE="/path/to/ignore" export WARC_NAME="example.com-20130810-panicgrab"(use this for grabbing single domain names:)
wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep:)
wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"
(use this for for grabbing single domain names recursively, and have the spider follow links up to 20 levels deep:)
wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=20 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"
(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but EXCLUDE a certain file or path:)
wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" -X "$FILES_AND_PATHS_TO_EXCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"
(use this for for grabbing single domain names recursively, and have the spider follow links up to 10 levels deep, but do NOT crawl upwards and grab stuff from the parent directory:)
wget -e robots=off --mirror --page-requisites --save-headers --keep-session-cookies --save-cookies "cookies.txt" --recursive --level=10 --no-parent --wait 2 --waitretry 3 --timeout 60 --tries 3 --span-hosts --domains="$SPECIFIC_HOSTNAMES_TO_INCLUDE" --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" -U "$USER_AGENT" "$DOMAIN_NAME_TO_SAVE"
Note that all of this code is explicitly ignoring the website's robots.txt file, the ethics of which is left up to your own discretion.