Task

The task here is to archive the publications listed on

Site	Archive	Count	Status
http://www.musyc.org/pubs	archive.org	about 472 pubs, few of which would be public	In progress 10/20/2017
https://ptolemy.eecs.berkeley.edu/publications/	archive.org	Hundreds of pubs, many of which are on the sites below	Done 10/20/2017
https://ptolemy.eecs.berkeley.edu/presentations/	archive.org	A hundred?	Done 10/20/2017
https://chess.eecs.berkeley.edu/pubs	archive.org	about 1200 pubs, most of which would be public, 801 pdfs in /home/www/wwwdata/chess.eecs.berkeley.edu	Done 10/20/2017
https://www.icyphy.org/pubs	archive.org	tens of pubs, most of which would not be public	Done 10/20/2017
https://www.terraswarm.org/pubs	archive.org	about 1000 pubs, not all of which should be public, 964 pdfs in /home/www/wwwdata/terraswarm.org	Done 10/21/2017
https://www.truststc.org/pubs	archive.org	910 pubs, not all of which would be public	Done 10/21/2017
https://www2.eecs.berkeley.edu/Pubs/TechRpts/	archive.org	Some of the above refer to pdfs on this site, only need to add 2016 and 2017	Done 10/21/2017

We could get a better estimate by mirroring the sites, but a rough estimate would be 3000 pdfs.

archive.org

https://archive.org has various snapshots of the target sites, but does not have complete copies.

To verify this:

To to https://web.archive.org/web/*/https://ptolemy.eecs.berkeley.edu/publications
Start browsing pubs and trying to download publications

Uploading to archive.org

https://archive.org/about/faqs.php#1 says:

How can I get my site included in the Wayback Machine?

Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.

Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.

Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.

Manual Updating of archive.org

If, while browsing a site on archive.org, the page is not archived, it is possible to upload the link:

One possibility would be to go through the sites and do this by hand for 3000 pdfs. Assuming 6 pubs a minute, this would be just over 8 hours.

Or, we could mirror the public sites, get a list of pdfs that should be archived and create URLs like

https://web.archive.org/save/https://chess.eecs.berkeley.edu/pubs/1187/KimEtAl_SST_IoTDI2017.pdf

and then go to those pages.

Archive a site:

  wget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ &>terra.out

Create a script to save pages. We get the URLs of the pages searched because the faqs in the sites are directories but end up being saved by wget as index.html.

  egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out

Then run the script:

  nohup sh terra2.out >& terra3.out

MuSyC

wget -r -N -l inf --no-remove-listing -np http://www.musyc.org/pubs/ >& musyc.out
egrep -e '^--2017.* http://www.musyc.org' musyc.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > musyc2.out

[$[Get Code]]

PtolemyPubs

Need to do presentations, too Some of the pub directories are referred to with urls that end in /, but the index.html file does not exist.

So, we search the output of wget for URLs.

wget -r -N -l inf --no-remove-listing -np https://ptolemy.eecs.berkeley.edu/publications/ &>ptolemy.out
egrep -e '^--2017.* https://ptolemy.eecs.berkeley.edu/publications' ptolemy.out | awk '{print $3}' | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $1, 39; print "sleep " int (rand() * 10)}' > p2
sh p2

[$[Get Code]]

TerraSwarm

wget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ >& terra.out
egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out
nohup sh terra2.out >& terra3.out &

[$[Get Code]]

truststc.org

wget -r -N -l inf --no-remove-listing -np https://www.truststc.org/pubs/ >& trust.out
egrep -e '^--2017.* https://www.truststc.org' trust.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > trust2.out

[$[Get Code]]

www2.eecs

Just get 2016 and 2017, the other years seem fine.

wget -r -N -l inf --no-remove-listing -np https://www2.eecs.berkeley.edu/Pubs/TechRpts/ >& eecs.out
egrep -e '^--2017.* https://www2.eecs.berkeley.edu/Pubs/TechRpts/201[67]/' eecs.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > eecs2.out

[$[Get Code]]

Wayback Machine Resources

https://web.archive.org/save/https://www.icyphy.org/pubs/53.htm - How to mirror the sites
https://github.com/buren/wayback_archiver - "Ruby gem to send URLs to Wayback Machine"
https://github.com/scribblemaniac/wayback-save - "Wayback Save is simple command-line utility save URLs to the Wayback Machine."

Archive-it

https://archive-it.org/ is a for-fee subscription service that will create collections of web pages.

This seems like overkill for our efforts.

Open Access/eScholarship

https://osc.universityofcalifornia.edu/open-access-policy/

"The Academic Senate of the University of California adopted an Open Access Policy on July 24, 2013, ensuring that future research articles authored by faculty at all 10 campuses of UC will be made available to the public at no charge. A precursor to this policy was adopted by the UCSF Academic Senate on May 21, 2012."

"On October 23, 2015, a Presidential Open Access Policy expanded open access rights and responsibilities to all other authors who write scholarly articles while employed at UC, including non-senate researchers, lecturers, post-doctoral scholars, administrative staff, librarians, and graduate students."

There are two portals

"Senate faculty"

"Senate faculty are contacted via email to verify their list of articles within UC’s new publication management system and to upload a copy or provide a link to an open access version of their publications. Faculty can also log in to the system at any time by visiting oapolicy.universityofcalifornia.edu."

"Non-senate employees"

"Deposit for these authors is managed through eScholarship, UC’s open access repository and publishing platform. Select your campus below to begin the deposit process – or to notify us that your publication is already available in another open access repository. (Note: you will be asked to log in. If you don’t yet have an eScholarship account, you will have the opportunity to create one.)"

Delegation: Can I delegate someone else to manage my publications?

"You can delegate someone else to manage your publications by filling out the UC Publications Management delegation form.

Uploading each publication by hand would be very labor intensive.

ArchivingPublications

Task

archive.org

Uploading to archive.org

Manual Updating of archive.org

MuSyC

PtolemyPubs

TerraSwarm

truststc.org

www2.eecs

Wayback Machine Resources

Archive-it

Open Access/eScholarship