Main /
ArchivingPublicationsTaskThe task here is to archive the publications listed on
We could get a better estimate by mirroring the sites, but a rough estimate would be 3000 pdfs. archive.orghttps://archive.org has various snapshots of the target sites, but does not have complete copies. To verify this:
Uploading to archive.orghttps://archive.org/about/faqs.php#1 says: How can I get my site included in the Wayback Machine?
Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.
Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.
Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.
See also If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine (2017-01-25) Manual Updating of archive.orgIf, while browsing a site on archive.org, the page is not archived, it is possible to upload the link: One possibility would be to go through the sites and do this by hand for 3000 pdfs. Assuming 6 pubs a minute, this would be just over 8 hours. Or, we could mirror the public sites, get a list of pdfs that should be archived and create URLs like https://web.archive.org/save/https://chess.eecs.berkeley.edu/pubs/1187/KimEtAl_SST_IoTDI2017.pdf and then go to those pages. Archive a site: wget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ &>terra.out Create a script to save pages. We get the URLs of the pages searched because the faqs in the sites are directories but end up being saved by wget as index.html. egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out Then run the script: nohup sh terra2.out >& terra3.out MuSyCwget -r -N -l inf --no-remove-listing -np http://www.musyc.org/pubs/ >& musyc.out egrep -e '^--2017.* http://www.musyc.org' musyc.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > musyc2.out PtolemyPubsNeed to do presentations, too
Some of the pub directories are referred to with urls that end in So, we search the output of wget for URLs. wget -r -N -l inf --no-remove-listing -np https://ptolemy.eecs.berkeley.edu/publications/ &>ptolemy.out egrep -e '^--2017.* https://ptolemy.eecs.berkeley.edu/publications' ptolemy.out | awk '{print $3}' | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $1, 39; print "sleep " int (rand() * 10)}' > p2 sh p2 TerraSwarmwget -r -N -l inf --no-remove-listing -np https://www.terraswarm.org/pubs/ >& terra.out egrep -e '^--2017.* https://www.terraswarm.org' terra.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > terra2.out nohup sh terra2.out >& terra3.out & truststc.orgwget -r -N -l inf --no-remove-listing -np https://www.truststc.org/pubs/ >& trust.out egrep -e '^--2017.* https://www.truststc.org' trust.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > trust2.out www2.eecsJust get 2016 and 2017, the other years seem fine. wget -r -N -l inf --no-remove-listing -np https://www2.eecs.berkeley.edu/Pubs/TechRpts/ >& eecs.out egrep -e '^--2017.* https://www2.eecs.berkeley.edu/Pubs/TechRpts/201[67]/' eecs.out | awk '{printf "wget %chttps://web.archive.org/save/%s%c\n", 39, $NF, 39; print "sleep " int (rand() * 10)}' > eecs2.out Wayback Machine Resources
Archive-ithttps://archive-it.org/ is a for-fee subscription service that will create collections of web pages.
Open Access/eScholarship"The Academic Senate of the University of California adopted an Open Access Policy on July 24, 2013, ensuring that future research articles authored by faculty at all 10 campuses of UC will be made available to the public at no charge. A precursor to this policy was adopted by the UCSF Academic Senate on May 21, 2012."
"On October 23, 2015, a Presidential Open Access Policy expanded open access rights and responsibilities to all other authors who write scholarly articles while employed at UC, including non-senate researchers, lecturers, post-doctoral scholars, administrative staff, librarians, and graduate students."
There are two portals "Senate faculty"
"Senate faculty are contacted via email to verify their list of articles within UC’s new publication management system and to upload a copy or provide a link to an open access version of their publications. Faculty can also log in to the system at any time by visiting oapolicy.universityofcalifornia.edu."
"Non-senate employees"
"Deposit for these authors is managed through eScholarship, UC’s open access repository and publishing platform. Select your campus below to begin the deposit process – or to notify us that your publication is already available in another open access repository. (Note: you will be asked to log in. If you don’t yet have an eScholarship account, you will have the opportunity to create one.)"
Delegation: Can I delegate someone else to manage my publications? "You can delegate someone else to manage your publications by filling out the UC Publications Management delegation form.
Uploading each publication by hand would be very labor intensive. |