Published March 6, 2024
| Version v2
Software
Open
Code from "Measuring data rot: an analysis of the continued availability of shared data from a single university"
Description
R scripts and files from the article "Measuring Data Rot: An Analysis of the Continued Availability of Shared Data from a Single University" by Kristin Briney.
This research looked at supplemental data links from publications in CaltechAUTHORS and tested them for their availability on the web using web scraping and hand testing in the Chrome browser.
Project file:
- SupplementaryData.Rproj: R project covering all of the R scripts.
Data input:
- supp-data.csv: raw data from CaltechAUTHORS, downloaded in May 2023.
Scripts to be done in order:
- 1-DataParsing.R: Cleans up the input data for processing and calculates some summary statistics.
- 2-DOIScraping.R: Scrapes and processes DOI prefixes.
- 3-DOIAnalysis.R: Adds DOI prefix information back into larger dataset.
- 4-LinkDecay.R: Scrapes URLs and DOIs to see if the data is still available and outputs data.
- 5-dataRot.R: Measures error rate (based on sampling), creates figures, and fits Poisson regression to data loss over time.
Data outputs:
- 4-linkDecay.csv: Full data set of supplemental data links including outcomes of web scraping.
- 4-linksToCheck.csv: Information on all supplemental data links that failed to scrape or could not be web scraped, basically everything that needed to be checked by hand.
- 5-resolves.csv: Final dataset denoting if links resolve, which was used for generating figures and fits.
- 5-sampling.csv: Results of comparing a sample from CaltechAUTHORS versus journal articles to check for errors in recording supplemental data links in CaltechAUTHORS.
DOI files:
- doi-data.zip: contains information harvested from DataCite and CrossRef looking up various DOI prefixes.
- 2-DOI_APIs.txt: not used for computation but contains information on DataCite and CrossRef's API's.
Other files:
- Other files are labelled 1-4 at the beginning of the file name to note which part of the analysis workflow they were created during. Most capture all or parts of the data at mid-points in the analysis.
- Files labelled with 5 at the beginning of the file name are used for final calculations and figures.
Files
README.md
Files
(18.9 MB)
Name | Size | Actions |
---|---|---|
md5:8a48c9930eae6a76ee41955f4c018e55
|
3.0 kB | Preview Download |
md5:e1fe1c91873ba18c5034ab6cff4f751c
|
277.1 kB | Preview Download |
md5:a892ee8d532384842862eb1bdc5cdd7b
|
17.0 MB | Preview Download |
md5:6e6f5f0a8fdfc9c14096cae521db98ed
|
2.3 kB | Download |
md5:341bff7c21884fef50a0058a217e8dd2
|
203 Bytes | Preview Download |
md5:584b2434d675b192dc1614bd62b5b93e
|
218 Bytes | Download |
md5:66441ded92807bbe89065cade546aa2e
|
1.2 kB | Preview Download |
md5:06eaf49e15005bcb09029ecf47b32589
|
3.1 kB | Download |
md5:ac47d70b80df7d990bb165b1f3078b98
|
96 Bytes | Preview Download |
md5:8c64b0793885149b54a0db0c947de091
|
2.8 kB | Preview Download |
md5:82937f87006b7b85a6aefbe934cb14f2
|
36.8 kB | Preview Download |
md5:57ebfe062dac448c627e1a3649de6d7c
|
547.8 kB | Preview Download |
md5:8eb63af8a7109520424753ea1d7e2ea8
|
57.6 kB | Preview Download |
md5:e1caf1e24bcd1539622ce7c7849e3fa5
|
6.9 kB | Download |
md5:dac478f5a1e507f2b6bc570d16763324
|
1.5 kB | Download |
md5:5bb2c485f57ad0e05d13c22f13a1aaf9
|
1.9 kB | Preview Download |
md5:35d608c7879123f9ff0b6c03c8434d1c
|
293.1 kB | Preview Download |
md5:0d2007d777527d1be4eb534f772233bc
|
6.8 kB | Download |
md5:a435a648a2964deff5b88667ab1febbd
|
596.2 kB | Preview Download |
md5:1245d1e9597d16aed64fb888b7330332
|
63.5 kB | Preview Download |
Additional details
- Updated
-
2024-03-06Version 2 of code, reflecting updates to the final analysis of the data