dataset stores JSON objects and can store a form of data frame based on grids of data. This document outlines the ideas behings grid and frame support in dataset.
Collections are at the core of the dataset tool. A collection is a bucketed directory structure storing a JSON objects in plaintext with optional attachments. The root folder for the collection contains a collection.json file with the metadata associating a name to a bucket for the store json object. One of the guiding ideas behind dataset was to keep everything as plain text whenever reasonable. The dataset project provides Go package for working with dataset collections, a python package (based on a shared library with the Go package) and command line tool.
Dataset collections are typically stored on your local disc but may be stored easily in Amazon’s S3 (or compatible platform) or Google’s cloud storage. Dataset and also import and export to/from a Google sheet or Excel file.
Dataset isn’t a database (there are plenty of JSON oriented databases out there, e.g. CouchDB, MongoDB). Rather the focus is on providing a mechanism to manage JSON objects, group them and to provide alternative data shapes for the viewing the collection (e.g. frames, grids).
A grid is a 2D JSON array based on combining a set of keys (rows) and a list of dot paths (columns). It is similar to the data shape you’d use in spreadsheets. It is a convenient data shape for build indexes, filtering and sorting. grid support is also available in dataset’s Python 3 package
Here’s an example of a possible grid for titles and authors.
[
["title", "authors"],
["20,000 Leagues under the Sea", "Verne, Jules"],
["All Around the Moon", "Verne, Jules"],
["The Short Reign of Pippin IV", "Steinbeck, John"]
]
If a column is missing a value then you should see a “null” for that cell. Here is an expanded example where we’ve added a link to Project Gutenberg as a third column.
[
["title", "authors", "gutenberg_href"],
["20,000 Leagues under the Sea", "Verne, Jules", "http://www.gutenberg.org/ebooks/6538"],
["All Around the Moon", "Verne, Jules", "http://www.gutenberg.org/ebooks/16457"],
["The Short Reign of Pippin IV", "Steinbeck, John", null]
]
This example creates a two column grid with DOI and titles from a dataset collection called Pubs.ds using the dataset command. Step one, generate a list of keys saving them to a file and step two is using that file of keys to generate the grid specifying the “.doi” and “.title” fields found in the JSON objects stored in the Pub.ds collection. If either “.doi” or “.title” is missing in a JSON object then a “null” value will be used. This way the grid rows retain the same number of cells.
dataset Pubs.ds keys >pubs.keys
dataset Pubs.ds grid pubs.keys .doi .title
The 2D JSON array is easy to process in programming languages like Python. Below is an example of using a grid for sorting across an entire collection leveraging Python’s standard sort method for lists.
import sys
import dataset
from operator import itemgetter
keys = dataset.keys("Pubs.ds")
(g, err) = dataset.grid("Pubs.ds", [".doi", ".title"])
# g holds the 2D arrary
if err != '':
print(f'{err}')
sys.exit(1)
# sort by title
g.sort(key=itemgetter (1))
for row in g:
(doi, title) = row
print(f'{doi} {title}')
Implementing the grid verb started me thinking about the similarity to data frames in Python, Julia and Octave. A frame could be defined as a grid plus metadata about the grid. In this context dataset could operate on *grid*s stored as one or more *frame*s. This in turn could lead to interesting processing pipelines, e.g. object(s) to collections, collections to grids, grids to frames which can then be stored back as objects in collections.
To make a frame from a grid we add the missing bits of useful metadata. At a glance the dot paths that define the columns of the grid, likewise the record keys for creating the rows are useful metadata. Both could be captured when the grid was created. It also would be useful to include labels for exporting to spreadsheets. A cell’s label could be automatically generated by running a dot path through a translation function. You could include the time the grid was generated as well as the collection name of its origin. It would be handy if you could name the frame too.
The frame_name is the only missing information from the grid command and is easy enough to add from it’s syntax.
Defining the frame’s metadata manually could get cumbersome. It feels like a similar problem as defining search indexes. If we start with a rich context at grid creation fleshing out the frame definition would be adding the frame’s name.
A system of *frame*s could be stored alongside dataset’s collection.json file. This allows frame definitions to travel with the collection for later reuse or to be automatically refreshed.
{
"frame_name": ...,
"collection_name": ...,
"updated": TIMESTAMP,
"created": TIMESTAMP,
"labels": [ ... ],
"dot_paths": [ ... ],
"column_types": [ ... ],
"grid": [[ ... ], ... ]
}
sketch of frame structure
frame definitions plus column type detection may allow for automatic index definition generation to be used by Bleve, Solr, and Lunr based search engines.
Example creating a frame named “titles-and-dois”
dataset Pubs.ds keys >pubs.keys
dataset Pubs.ds frame titles-and-dois pubs.keys .doi .title
Or in python
keys = dataset.keys('Pubs.ds')
frame = dataset.frame('Pubs.ds', 'titles-and-dois', keys, ['.doi', '.title'])
Example of getting the contents of an existing frame.
dataset Pubs.ds frame titles-and-dois
Or in python
frame = dataset.frame('Pubs.ds', 'titles-and-dois')
Regenerating “titles-and-dois”.
dataset Pubs.ds reframe titles-and-dois
Or in python
frame = dataset.reframe('Pubs.ds', 'titles-and-dois')
dataset Pubs.ds keys >updated.keys
dataset Pubs.ds reframe titles-and-dios updated.keys
In python
frame = dataset.reframe('Pubs.ds', 'titles-and-dois', updated_keys)
Labels are represented as a JSON array, when we set the labels explicitly we’re replacing the entire array at once. In this example the frame’s grid has two columns.
dataset Pubs.ds frame-labels titles-and-dois '["Column 1", "Column 2"]'
In python
err = dataset.frame_labels('Pubs.ds', 'titles-and-dois', ["Column 1", "Column 2"])
Column types are represented as a JSON array. Column types provide hints to the indexer when indexing a collection or frame. The standard JSON types are supported (e.g. string, number, object, list) plus keyword, number, datetime, and geolocation. In this example we will change are
will change the column types from [
"
string
"
,
"
string
"
]
to [
"
string
"
,
"
keyword
"
]
err = dataset Pubs.ds frame-types titles-and-dois '["string", "keyword"]'
In python
err = dataset.frame_types('Pubs.ds', 'titles-and-dois', ["string","keyword"])
dataset Pubs.ds remove-frame titles-and-dios
Or in python
err = dataset.remove_frame('Pubs.ds', 'titles-and-dois')
dataset Pubs.ds frames
Or in python
frame_names = dataset.frames('Pubs.ds')
One possible use of frames would be in rendering search indexes like pthose used by Bleve or Lunrjs. A frame proves all information needed for transforming the frame’s. value (i.e. grid) into minimalist documents for indexing. One workflow for creating a searchable collection might be