Caltech Library logo

Use dataset with S3

dataset now support integration with S3 storage. Store dataset content on AWS S3 you should download and install the aws cli sdk, setup your buckets and configure permissions, access keys, etc. dataset will use your local SDK’s configuration (e.g. $HOME/.aws) to configure the connection. You need only set one environment variable, run the dataset init option and add the resulting suggested environment variable for working with your dataset stored at S3.

Basic steps

  1. Set AWS_SDK__LOAD_CONFIG environment variable
  2. Envoke the dataaset init command with your “s3://” URL appended with your collectio name
  3. Set DATASET environment variable

In the following shell example our bucket is called “dataset.library.exampl.edu” and our dataset collection is called “mycollection”.

    export AWS_SDK_LOAD_CONFIG=1
    dataset init s3://dataset.library.example.edu/mycollection
    export DATASET=s3://dataset.library.example.edu/mycollection

We can now create a JSON record to add called “waldo” and add it to our collection.

    cat<<EOT>waldo-reading.json
    {
        "reader":"Waldo",
        "author":"Robert Louis Stevenson",
        "title":"The Black Arrow",
        "url":"https://www.gutenberg.org/ebooks/848"
    }
    EOT
    cat waldo-reading.json | dataset create waldo-reading

List the keys in our dataset

    dataset list keys

Now let’s download a copy of what Waldo is reading and attach it to our “waldo-reading” record.

    curl -O https://www.gutenberg.org/ebooks/848.txt.utf-8
    dataset attach waldo-reading 848.txt.utf-8

To check out attachments

    dataset attachments waldo-reading