Recent Changes - Search:

edit SideBar

ArchivingPublications

Task

The task here is to archive the publications listed on

SiteArchiveCountStatus
http://www.musyc.org/pubsarchive.orgabout 472 pubs, few of which would be publicIn progress 10/20/2017
https://ptolemy.eecs.berkeley.edu/publications/archive.orgHundreds of pubs, many of which are on the sites belowDone 10/20/2017
https://ptolemy.eecs.berkeley.edu/presentations/archive.orgA hundred?Done 10/20/2017
https://chess.eecs.berkeley.edu/pubsarchive.orgabout 1200 pubs, most of which would be public, 801 pdfs in /home/www/wwwdata/chess.eecs.berkeley.eduDone 10/20/2017
https://www.icyphy.org/pubsarchive.orgtens of pubs, most of which would not be publicDone 10/20/2017
https://www.terraswarm.org/pubsarchive.orgabout 1000 pubs, not all of which should be public, 964 pdfs in /home/www/wwwdata/terraswarm.orgDone 10/21/2017
https://www.truststc.org/pubsarchive.org910 pubs, not all of which would be publicDone 10/21/2017
https://www2.eecs.berkeley.edu/Pubs/TechRpts/archive.orgSome of the above refer to pdfs on this site, only need to add 2016 and 2017Done 10/21/2017

We could get a better estimate by mirroring the sites, but a rough estimate would be 3000 pdfs.

archive.org

https://archive.org has various snapshots of the target sites, but does not have complete copies.

To verify this:

  1. To to https://web.archive.org/web/*/https://ptolemy.eecs.berkeley.edu/publications
  2. Start browsing pubs and trying to download publications

Uploading to archive.org

https://archive.org/about/faqs.php#1 says:

How can I get my site included in the Wayback Machine?
Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.
Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.
Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.

See also If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine (2017-01-25)

Manual Updating of archive.org

If, while browsing a site on archive.org, the page is not archived, it is possible to upload the link:

One possibility would be to go through the sites and do this by hand for 3000 pdfs. Assuming 6 pubs a minute, this would be just over 8 hours.

Or, we could mirror the public sites, get a list of pdfs that should be archived and create URLs like

https://web.archive.org/save/https://chess.eecs.berkeley.edu/pubs/1187/KimEtAl_SST_IoTDI2017.pdf

and then go to those pages.

Archive a site:

  wget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ &>terra.out

Create a script to save pages. We get the URLs of the pages searched because the faqs in the sites are directories but end up being saved by wget as index.html.

  egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out

Then run the script:

  nohup sh terra2.out >& terra3.out

MuSyC

wget -r -N -l inf --no-remove-listing -np http://www.musyc.org/pubs/ >& musyc.out
egrep -e '^--2017.* http://www.musyc.org' musyc.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > musyc2.out

PtolemyPubs

Need to do presentations, too Some of the pub directories are referred to with urls that end in /, but the index.html file does not exist.

So, we search the output of wget for URLs.

wget -r -N -l inf --no-remove-listing -np https://ptolemy.eecs.berkeley.edu/publications/ &>ptolemy.out
egrep -e '^--2017.* https://ptolemy.eecs.berkeley.edu/publications' ptolemy.out | awk '{print $3}' | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $1, 39; print "sleep " int (rand() * 10)}' > p2
sh p2

TerraSwarm

wget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ >& terra.out
egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out
nohup sh terra2.out >& terra3.out &

truststc.org

wget -r -N -l inf --no-remove-listing -np https://www.truststc.org/pubs/ >& trust.out
egrep -e '^--2017.* https://www.truststc.org' trust.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > trust2.out

www2.eecs

Just get 2016 and 2017, the other years seem fine.

wget -r -N -l inf --no-remove-listing -np https://www2.eecs.berkeley.edu/Pubs/TechRpts/ >& eecs.out
egrep -e '^--2017.* https://www2.eecs.berkeley.edu/Pubs/TechRpts/201[67]/' eecs.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > eecs2.out

Wayback Machine Resources

Archive-it

https://archive-it.org/ is a for-fee subscription service that will create collections of web pages.

  • This seems like overkill for our efforts.

Open Access/eScholarship

"The Academic Senate of the University of California adopted an Open Access Policy on July 24, 2013, ensuring that future research articles authored by faculty at all 10 campuses of UC will be made available to the public at no charge. A precursor to this policy was adopted by the UCSF Academic Senate on May 21, 2012."
"On October 23, 2015, a Presidential Open Access Policy expanded open access rights and responsibilities to all other authors who write scholarly articles while employed at UC, including non-senate researchers, lecturers, post-doctoral scholars, administrative staff, librarians, and graduate students."

There are two portals

"Senate faculty"
"Senate faculty are contacted via email to verify their list of articles within UC’s new publication management system and to upload a copy or provide a link to an open access version of their publications. Faculty can also log in to the system at any time by visiting oapolicy.universityofcalifornia.edu."
"Non-senate employees"
"Deposit for these authors is managed through eScholarship, UC’s open access repository and publishing platform. Select your campus below to begin the deposit process – or to notify us that your publication is already available in another open access repository. (Note: you will be asked to log in. If you don’t yet have an eScholarship account, you will have the opportunity to create one.)"

Delegation: Can I delegate someone else to manage my publications?

"You can delegate someone else to manage your publications by filling out the UC Publications Management delegation form.

Uploading each publication by hand would be very labor intensive.

Edit - History - Print - Recent Changes - Search
Page last modified on October 22, 2017, at 12:45 am