Caltech Library logo

dataset DOI

dataset is a command line tool for working with JSON (object) documents stored as collections. This supports basic storage actions (e.g. CRUD operations, filtering and extraction) as well as indexing, searching. A project goal of dataset is to “play nice” with shell scripts and other Unix tools (e.g. it respects standard in, out and error with minimal side effects). This means it is easily scriptable via Bash, Posix shell or interpretted languages like R.

dataset includes an implementation as a Python3 module. The same functionality as in the command line tool is replicated for Python3. (module requires Python 3.6 or better).

Finally dataset is a golang package for managing JSON documents and their attachments on disc or in cloud storage (e.g. Amazon S3, Google Cloud Storage). The command line utilities excersize this package extensively.

The inspiration for creating dataset was the desire to process metadata as JSON document collections using Unix shell utilities and pipe lines. While it has grown in capabilities that remains a core use case.

dataset organanizes JSON documents by unique names in collections. Collections are represented as an index into a series of buckets. The buckets are subdirectories (or paths under cloud storage services). Buckets hold individual JSON documents and their attachments. The JSON document is assigned automatically to a bucket (and the bucket generated if necessary) when it is added to a collection. Assigning documents to buckets avoids having too many documents assigned to a single path (e.g. on some Unix there is a limit to how many documents are held in a single directory). In addition to using the dataset comnad you can list and manipulate the JSON documents directly with common Unix commands like ls, find, grep or their cloud counter parts.

See getting-started-with-datataset.md for a tour of functionality.

Limitations of dataset

dataset has many limitations, some are listed below

Operations

The basic operations support by dataset are listed below organized by collection and JSON document level.

Collection Level

JSON Document level

JSON Document Attachments

Search

Example

Common operations using the dataset command line tool

    # Create a collection "mystuff.ds", the ".ds" lets the bin/dataset command know that's the collection to use. 
    bin/dataset mystuff.ds init
    # if successful then you should see an OK otherwise an error message

    # Create a JSON document 
    bin/dataset mystuff.ds create freda '{"name":"freda","email":"freda@inverness.example.org"}'
    # If successful then you should see an OK otherwise an error message

    # Read a JSON document
    bin/dataset mystuff.ds read freda
    
    # Path to JSON document
    bin/dataset mystuff.ds path freda

    # Update a JSON document
    bin/dataset mystuff.ds update freda '{"name":"freda","email":"freda@zbs.example.org", "count": 2}'
    # If successful then you should see an OK or an error message

    # List the keys in the collection
    bin/dataset mystuff.ds keys

    # Get keys filtered for the name "freda"
    bin/dataset mystuff.ds keys '(eq .name "freda")'

    # Join freda-profile.json with "freda" adding unique key/value pairs
    bin/dataset mystuff.ds join append freda freda-profile.json

    # Join freda-profile.json overwriting in commont key/values adding unique key/value pairs
    # from freda-profile.json
    bin/dataset mystuff.ds join overwrite freda freda-profile.json

    # Delete a JSON document
    bin/dataset mystuff.ds delete freda

    # Import data from a CSV file using column 1 as key
    bin/dataset -quiet -nl=false mystuff.ds import-csv my-data.csv 1

    # To remove the collection just use the Unix shell command
    rm -fR mystuff.ds

Releases

Compiled versions are provided for Linux (amd64), Mac OS X (amd64), Windows 10 (amd64) and Raspbian (ARM7). See https://github.com/caltechlibrary/dataset/releases.