dataset stores JSON objects and can store a form of data frame based on grids of data. This document outlines the ideas behings grid and frame support in dataset.
Collections are at the core of the dataset tool. A collection is a bucketed directory structure storing a JSON objects in plaintext with optional attachments. The root folder for the collection contains a collection.json file with the metadata associating a name to a bucket for the store json object. One of the guiding ideas behind dataset was to keep everything as plain text whenever reasonable. The dataset project provides Go package for working with dataset collections, a python package (based on a shared library with the Go package) and command line tool.
Dataset collections are typically stored on your local disc but may be stored easily in Amazon’s S3 (or compatible platform) or Google’s cloud storage. Dataset and also import and export to/from a Google sheet or Excel file.
Dataset isn’t a database (there are plenty of JSON oriented databases out there, e.g. CouchDB, MongoDB). Rather the focus is on providing a mechanism to manage JSON objects, group them and to provide alternative data shapes for the viewing the collection (e.g. frames, grids).
A grid is a 2D JSON array based on combining a set of keys (rows) and a list of dot paths (columns). It is similar to the data shape you’d use in spreadsheets. It is a convenient data shape for build indexes, filtering and sorting. grid support is also available in dataset’s Python 3 package
Here’s an example of a possible grid for titles and authors.
[
["title", "authors"],
["20,000 Leagues under the Sea", "Verne, Jules"],
["All Around the Moon", "Verne, Jules"],
["The Short Reign of Pippin IV", "Steinbeck, John"]
]
If a column is missing a value then you should see a “null” for that cell. Here is an expanded example where we’ve added a link to Project Gutenberg as a third column.
[
["title", "authors", "gutenberg_href"],
["20,000 Leagues under the Sea", "Verne, Jules", "http://www.gutenberg.org/ebooks/6538"],
["All Around the Moon", "Verne, Jules", "http://www.gutenberg.org/ebooks/16457"],
["The Short Reign of Pippin IV", "Steinbeck, John", null]
]
This example creates a two column grid with DOI and titles from a dataset collection called Pubs.ds using the dataset command. Step one, generate a list of keys saving them to a file and step two is using that file of keys to generate the grid specifying the “.doi” and “.title” fields found in the JSON objects stored in the Pub.ds collection. If either “.doi” or “.title” is missing in a JSON object then a “null” value will be used. This way the grid rows retain the same number of cells.
dataset Pubs.ds keys >pubs.keys
dataset Pubs.ds grid pubs.keys .doi .title
The 2D JSON array is easy to process in programming languages like Python. Below is an example of using a grid for sorting across an entire collection leveraging Python’s standard sort method for lists.
import sys
import dataset
from operator import itemgetter
keys = dataset.keys("Pubs.ds")
(g, err) = dataset.grid("Pubs.ds", [".doi", ".title"])
# g holds the 2D arrary
if err != '':
print(f'{err}')
sys.exit(1)
# sort by title
g.sort(key=itemgetter (1))
for row in g:
(doi, title) = row
print(f'{doi} {title}')
Implementing the grid verb started me thinking about the similarity to data frames in Python, Julia and Octave. A frame could be defined as a grid plus metadata about the grid. In this context dataset could operate on *grid*s stored as one or more *frame*s. This in turn could lead to interesting processing pipelines, e.g. object(s) to collections, collections to grids, grids to frames which can then be stored back as objects in collections.
To make a frame from a grid we add the missing bits of useful metadata. At a glance the dot paths that define the columns of the grid, likewise the record keys for creating the rows are useful metadata. Both could be captured when the grid was created. It also would be useful to include labels for exporting to spreadsheets. A cell’s label could be automatically generated by running a dot path through a translation function. You could include the time the grid was generated as well as the collection name of its origin. It would be handy if you could name the frame too.
The frame_name is the only missing information from the grid command and is easy enough to add from it’s syntax.
Defining the frame’s metadata manually could get cumbersome. It feels like a similar problem as defining search indexes. If we start with a rich context at grid creation fleshing out the frame definition would be adding the frame’s name.
A system of *frame*s could be stored alongside dataset’s collection.json file. This allows frame definitions to travel with the collection for later reuse or to be automatically refreshed.
{
"frame_name": ...,
"collection_name": ...,
"updated": TIMESTAMP,
"created": TIMESTAMP,
"labels": [ ... ],
"dot_paths": [ ... ],
"column_types": [ ... ],
"grid": [[ ... ], ... ]
}
sketch of frame structure
frame definitions plus column type detection may allow for automatic index definition generation to be used by Bleve, Solr, and Lunr based search engines.
Example creating a frame named “titles-and-dois”
dataset Pubs.ds keys >pubs.keys
dataset Pubs.ds frame titles-and-dois pubs.keys .doi .title
Or in python
keys = dataset.keys('Pubs.ds')
frame = dataset.frame('Pubs.ds', 'titles-and-dois', keys, ['.doi', '.title'])
Example of getting the contents of an existing frame.
dataset Pubs.ds frame titles-and-dois
Or in python
frame = dataset.frame('Pubs.ds', 'titles-and-dois')
Regenerating “titles-and-dois”.
dataset Pubs.ds reframe titles-and-dois
Or in python
frame = dataset.reframe('Pubs.ds', 'titles-and-dois')
dataset Pubs.ds keys >updated.keys
dataset Pubs.ds reframe titles-and-dios updated.keys
In python
frame = dataset.reframe('Pubs.ds', 'titles-and-dois', updated_keys)
Labels are represented as a JSON array, when we set the labels explicitly we’re replacing the entire array at once. In this example the frame’s grid has two columns.
dataset Pubs.ds frame-labels titles-and-dois '["Column 1", "Column 2"]'
In python
err = dataset.frame_labels('Pubs.ds', 'titles-and-dois', ["Column 1", "Column 2"])
Column types are represented as a JSON array. Column types provide hints to the indexer when indexing a collection or frame. The standard JSON types are supported (e.g. string, number, object, list) plus keyword, number, datetime, and geolocation. In this example we will change are
will change the column types from [
"
string
"
,
"
string
"
]
to [
"
string
"
,
"
keyword
"
]
err = dataset Pubs.ds frame-types titles-and-dois '["string", "keyword"]'
In python
err = dataset.frame_types('Pubs.ds', 'titles-and-dois', ["string","keyword"])
dataset Pubs.ds remove-frame titles-and-dios
Or in python
err = dataset.remove_frame('Pubs.ds', 'titles-and-dois')
dataset Pubs.ds frames
Or in python
frame_names = dataset.frames('Pubs.ds')
One possible use of frames would be in rendering search indexes like pthose used by Bleve or Lunrjs. A frame proves all information needed for transforming the frame’s. value (i.e. grid) into minimalist documents for indexing. One workflow for creating a searchable collection might be
Often when processing data it is useful to pull date into a grid format.
dataset provides a verb “grid” for doing just that. Below we’re going
to create a small dataset collection called grid_test.ds
, populate
it with some simple asymetric data (i.e. each record doesn’t have the
same fields) then turn this into a 2D JSON array suitable for further
processing in a language like Python, R or Julia. The grid verb
is available in the Python module for dataset so we’ll show that too.
In both examples the JSON representing our raw data seen in the file data-grids.json
From an existing collection, grid_test.ds
, create a list of keys.
dataset grid_test.ds keys > grid_test.keys
We have the following keys in our collection “gutenberg:21489”,
“gutenberg:2488”, “gutenberg:21839”, “gutenberg:3186”, “hathi:uc1321060001561131”. Let’s pick the first one and see what fields we might want
in our grid (notice we’re using the -p
option to pretty print
the JSON record).
dataset -p grid_test.ds read "gutenberg:21489"
The fields that we’re interested in are “._Key”, “.title”, “.authors”,
Now that we have a list of keys we’re interested and and know the dot paths to the fields we’re interested in we can create our grid.
dataset -p grid_test.ds grid grid_test.keys "._Key" ".title" ".authors"
The results are a 2D array wich rows for each key and cells matching the contents of the dot paths. Note that a cell may have a complex structure like that shown with “.authors”
In this example we’re use the dataset python module to read in our raw JSON data (e.g. data-grids.json) and convert it into a dataset collection called “grid_test.ds”. Next we’ll generate our set of keys and finally generate our grid as a python list of lists.
import sys
import json
import dataset
# Read in our test data and convert from JSON into an array of dicts
f_name = 'data-grids.json'
with open(f_name, mode = 'r', encoding = 'utf-8') as f:
src = f.read()
data = json.loads(src)
# create our collection
c_name = 'grid_test.ds'
err = dataset.init(c_name)
if err != '':
print(err)
sys.exit(1)
# load our test data
for key in data:
rec = data[key]
err = dataset.create(c_name, key, rec)
if err != '':
print(err)
sys.exit(1)
# Create a list of keys and list of dot paths
keys = dataset.keys(c_name)
dot_paths = ["._Key", ".title", ".authors"]
# now we can create our grid
(g, err) = dataset.grid(c_name, keys, dot_paths)
if err != '':
print(err)
sys.exit(1)
# Now pretty print our grid
print(json.dumps(g, indent = 4))