SupplementaryData

Analysis of supplementary data links in CaltechAUTHORS repository

Author: Kristin A. Briney

Date: 2022-01

Latest dataset downloaded: 2023-05-16 by Tom Morrell, @tmorrell

Where are Caltech authors sharing their research data?
What data is no longer available?
- How does this map over time?
- Are there specific sites where data is more likely to disappear?
Can we outreach to gather the missing data into the CaltechDATA repository?

1-DataParsing.R: Cleans up the input data "supp-data.csv" for processing and calculates some summary statistics.
2-DOIScraping.R: Scrapes and processes DOI prefixes
3-DOIAnalysis.R: Adds DOI prefix information back into larger dataset.
4-LinkDecay.R: Scrapes URLs and DOIs to see if the data is still available and outputs data as "4-linkDecay.csv".
5-dataRot.R: pulls from "5-sampling.csv" to determine error rate and fits final data, "5-resolves.csv", to measure loss over time.

Other files are labelled 1-4 at the beginning of the file name to note which part of the analysis workflow they correspond to.

Fix regex to match with base URL domain in script 1
Scraping DOI prefixes
- need to handle 404's
- clean and process scraped data
- analyze URL and DOI domains
Identify 404's
- dedupe publications w/ multiple links? (not necessary)
- count URL's the match base domain
- webscrape
- count 404's
Parse publication date information
Analyze 404's
- Verify that 404's are actually dead
- Check file-type URLs by hand
- Analyze with respect to date
- Analyze with respect to site
Outreach
- Check to see if we already have supplemental data files in CaltechAUTHORS coresponding to dead links
- Create spreadsheet of articles, links, and contacts

Are there publications with multiple data links?
- Yes, that is calculated in 1-DataParsing.R as dataLinks_perRecord and dataLinks_avg
I'm assuming one link per description (either URL or DOI) so should not be possible to have both a URL and DOI for one related link record?
- Verified with George on 2022-03-23 that there are no duplicates
- Either have URL or DOI, DOI preferred
How thorough is the collection of related links?
- Verified with George on 2022-04-28 that we've been collecting supplemental files and links for about a decade
- Thoroughness is hard to estimate
Do I want to clean and analyze related link descriptions?
- Not as part of this project, could be a later effort
Do I want to rectify URL and DOI domains (e.g. merge CaltechDATA URLs with DOIs), or keep separate?
- Going to keep separate

Aydin, O. (2018). R Web Scraping Quick Start Guide: Techniques and Tools to Crawl and Scrape Data from Websites. United Kingdom: Packt Publishing.