Published March 6, 2024 | Version v2
Software Open

Code from "Measuring data rot: an analysis of the continued availability of shared data from a single university"

  • 1. ROR icon California Institute of Technology

Description

R scripts and files from the article "Measuring Data Rot: An Analysis of the Continued Availability of Shared Data from a Single University" by Kristin Briney.

This research looked at supplemental data links from publications in CaltechAUTHORS and tested them for their availability on the web using web scraping and hand testing in the Chrome browser.

Project file:

  • SupplementaryData.Rproj: R project covering all of the R scripts.

Data input:

  • supp-data.csv: raw data from CaltechAUTHORS, downloaded in May 2023.

Scripts to be done in order:

  • 1-DataParsing.R: Cleans up the input data for processing and calculates some summary statistics.
  • 2-DOIScraping.R: Scrapes and processes DOI prefixes.
  • 3-DOIAnalysis.R: Adds DOI prefix information back into larger dataset.
  • 4-LinkDecay.R: Scrapes URLs and DOIs to see if the data is still available and outputs data.
  • 5-dataRot.R: Measures error rate (based on sampling), creates figures, and fits Poisson regression to data loss over time.

Data outputs:

  • 4-linkDecay.csv: Full data set of supplemental data links including outcomes of web scraping.
  • 4-linksToCheck.csv: Information on all supplemental data links that failed to scrape or could not be web scraped, basically everything that needed to be checked by hand.
  • 5-resolves.csv: Final dataset denoting if links resolve, which was used for generating figures and fits.
  • 5-sampling.csv: Results of comparing a sample from CaltechAUTHORS versus journal articles to check for errors in recording supplemental data links in CaltechAUTHORS.

DOI files:

  • doi-data.zip: contains information harvested from DataCite and CrossRef looking up various DOI prefixes.
  • 2-DOI_APIs.txt: not used for computation but contains information on DataCite and CrossRef's API's.

Other files:

  • Other files are labelled 1-4 at the beginning of the file name to note which part of the analysis workflow they were created during. Most capture all or parts of the data at mid-points in the analysis.
  • Files labelled with 5 at the beginning of the file name are used for final calculations and figures.

Files

README.md
Files (18.9 MB)
Name Size
md5:8a48c9930eae6a76ee41955f4c018e55
3.0 kB Preview Download
md5:e1fe1c91873ba18c5034ab6cff4f751c
277.1 kB Preview Download
md5:a892ee8d532384842862eb1bdc5cdd7b
17.0 MB Preview Download
md5:6e6f5f0a8fdfc9c14096cae521db98ed
2.3 kB Download
md5:341bff7c21884fef50a0058a217e8dd2
203 Bytes Preview Download
md5:584b2434d675b192dc1614bd62b5b93e
218 Bytes Download
md5:66441ded92807bbe89065cade546aa2e
1.2 kB Preview Download
md5:06eaf49e15005bcb09029ecf47b32589
3.1 kB Download
md5:ac47d70b80df7d990bb165b1f3078b98
96 Bytes Preview Download
md5:8c64b0793885149b54a0db0c947de091
2.8 kB Preview Download
md5:82937f87006b7b85a6aefbe934cb14f2
36.8 kB Preview Download
md5:57ebfe062dac448c627e1a3649de6d7c
547.8 kB Preview Download
md5:8eb63af8a7109520424753ea1d7e2ea8
57.6 kB Preview Download
md5:e1caf1e24bcd1539622ce7c7849e3fa5
6.9 kB Download
md5:dac478f5a1e507f2b6bc570d16763324
1.5 kB Download
md5:5bb2c485f57ad0e05d13c22f13a1aaf9
1.9 kB Preview Download
md5:35d608c7879123f9ff0b6c03c8434d1c
293.1 kB Preview Download
md5:0d2007d777527d1be4eb534f772233bc
6.8 kB Download
md5:a435a648a2964deff5b88667ab1febbd
596.2 kB Preview Download
md5:1245d1e9597d16aed64fb888b7330332
63.5 kB Preview Download

Additional details

Created:
March 6, 2024
Modified:
March 6, 2024