SupplementaryData
Analysis of supplementary data links in CaltechAUTHORS repository
Author: Kristin A. Briney
Date: 2022-01
Latest dataset downloaded: 2023-05-16 by Tom Morrell, @tmorrell
Research Questions
- Where are Caltech authors sharing their research data?
- What data is no longer available?
- How does this map over time?
- Are there specific sites where data is more likely to disappear?
- Can we outreach to gather the missing data into the CaltechDATA repository?
Code Organization
- 1-DataParsing.R: Cleans up the input data "supp-data.csv" for processing and calculates some summary statistics.
- 2-DOIScraping.R: Scrapes and processes DOI prefixes
- 3-DOIAnalysis.R: Adds DOI prefix information back into larger dataset.
- 4-LinkDecay.R: Scrapes URLs and DOIs to see if the data is still available and outputs data as "4-linkDecay.csv".
- 5-dataRot.R: pulls from "5-sampling.csv" to determine error rate and fits final data, "5-resolves.csv", to measure loss over time.
Other files are labelled 1-4 at the beginning of the file name to note which part of the analysis workflow they correspond to.
To-Do List
- Fix regex to match with base URL domain in script 1
- Scraping DOI prefixes
- need to handle 404's
- clean and process scraped data
- analyze URL and DOI domains
- Identify 404's
- dedupe publications w/ multiple links? (not necessary)
- count URL's the match base domain
- webscrape
- count 404's
- Parse publication date information
- Analyze 404's
- Verify that 404's are actually dead
- Check file-type URLs by hand
- Analyze with respect to date
- Analyze with respect to site
- Outreach
- Check to see if we already have supplemental data files in CaltechAUTHORS coresponding to dead links
- Create spreadsheet of articles, links, and contacts
Issues that have been Resolved
- Are there publications with multiple data links?
- Yes, that is calculated in 1-DataParsing.R as dataLinks_perRecord and dataLinks_avg
- I'm assuming one link per description (either URL or DOI) so
should not be possible to have both a URL and DOI for one related link record?
- Verified with George on 2022-03-23 that there are no duplicates
- Either have URL or DOI, DOI preferred
- How thorough is the collection of related links?
- Verified with George on 2022-04-28 that we've been collecting supplemental files and links for about a decade
- Thoroughness is hard to estimate
- Do I want to clean and analyze related link descriptions?
- Not as part of this project, could be a later effort
- Do I want to rectify URL and DOI domains (e.g. merge CaltechDATA URLs with DOIs), or keep separate?
Resources
- Aydin, O. (2018). R Web Scraping Quick Start Guide: Techniques and Tools to Crawl and Scrape Data from Websites. United Kingdom: Packt Publishing.