You have a dataset collection of JSON documents but the fields you’re interested in are nested. Using dataset and datatools’s jsonmunge you can reshape your existing collection’s record to the shape you prefer.
In this how to we will look at mondify a single record then once we have the record looking the way we want apply that transformation to the whole collection.
In our collection we have record with an id of 12345. Running dataset read 12345
we can see our record
looks like–
{
"title": "The wonderful world of data collecting",
"authors": [
{"family": "Brown", "given": "Jules"},
{"family": "Brown", "given": "Verne"}
]
}
What we’d like is a flattened version of the author names.
{
"title": "The wonderful world of data collecting",
"author_display_names": "Jules Brown and Verne Brown",
...
}
We’re going to pull out each others name object and then format them the way we prefer. jsonmunge lets us apply a Go text template to our JSON data and then output something. In our case our formatted names.
Generating our .author_display_names
field can be broken down into simpler parts. First we are going to look at formatting a single name
and then look at how to format both names and finally format an number of names. Inside .authors
array we have a name object. It has
.family
and .given
attributes. A simple template would reach in to the .authors
array by index and then order the .given
and
.family
attributes as desired. Array indexes count from zero so the first author’s index is zero. The template function dotpath lets
us reach inside the array.
Try this
dataset read 12345 | jsonmunge -i - -E '{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }}'
Let’s take this command pipeline apart. We retrieved our dataset record 12345 with dataset read 12345
. We send
that record to jsonmunge (-i -
is idiomatic of datatool commands for saying read from standard input since the record
should be coming from dataset’s standard output) and the -E
to
evaluate a simple template ordering out first author name.
Jules Brown
It’s a bit ugly (and long) but we can adapt that to display both names.
dataset read 12345 | jsonmunge -i - -E '{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }} and {{ dotpath . ".authors[1].given" "" }} {{ dotpath . "authors[1].family" "" }}'
getting
Jules Brown and Verne Brown
That command line is getting pretty long. Let’s take that expression and put it in a template file called “flatten.tmpl”.
{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }} and {{ dotpath . ".authors[1].given" "" }} {{ dotpath . "authors[1].family" "" }}
Run the template and see the results with
dataset read 12345 | jsonmunge -i - flatten.tmpl
We should again see
Jules Brown and Verne Brown
What happens for the next record where the number of authors is different? Looking at our original data we see that .authors
is an
array of objects. Go’s text templates have a function called range which makes it easy to iterate over arrays.
range can return the index value as well as the object at that index. Applying the range function would look like this
version of “flatten.tmpl”.
{{ range $i,$author := .authors }}
{{ if (gt $i 0) }} and {{ end }}
{{ $author.given }} {{ $author.family }}
{{ end }}
Running
dataset read 12345 | jsonmunge -i - flatten.tmpl
we get
Jules Brown
and
Verne Brown
That sorta gives us what we wanted but the spacing is all wrong and we have some extra line breaks. We could put all the template parts in
one line but that would make it hard to read and debug. Fortunately Go templates elements and indicate if leading or trailing whitespace should
be trimmed. You do that by using {{-
and -}}
for trimming leading and trailing whitespace. The revised template will look like
{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.given }} {{ $author.family -}}
{{- end -}}
and a running that through dataset read 12345 | jsonmunge -i - flatten.tmpl
gives us
Jules Brown and Verne Brown
Ok, so how does this help us reshape our origin 12345 record? Well first we need to turn our string “Jules Brown and Verne Brown” into an object. Updating our template the curly brackets and attribute nations gives us
{
"author_display_names": "{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.given }} {{ $author.family -}}
{{- end -}}"
}
Now running dataset read 12345 | jsonmunge -i - flatten.tmpl
gives us our new object.
{
"author_display_names": "Jules Brown and Verne Brown"
}
Now we ready to “join” our new object with the 12345 record. We can do that by extending our pipe line.
dataset read 12345 | jsonmunge -i - flatten.tmpl | dataset -i - join update 12345
We can check to make sure it worked with dataset read 12345
. You should see something like (order of attributes may vary)
{
"author_display_names": "Jules Brown and Verne Brown",
"authors": [
{ "family": "Brown", "given": "Jules" },
{ "family": "Brown", "given": "Verne" }
],
"title": "The wonderful world of data collecting"
}
Notice we now have a new .author_display_names
attribute in our object. We still see our old .authors
. The join update function
will not overwriting fields nor trim others. It adds the attributes of one object to the other.
Now let’s say you decide you’d rather have the names in “family, given” order for the individual names.
Using join overwrite we can replace the value in .author_display_names
with a new one.
flatten.tmpl should now look like
{
"author_display_names": "{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.family -}}, {{ $author.given -}}
{{- end -}}"
}
Running dataset read 12345 | jsonmunge -i - flatten.tmpl | dataset -i - join overwrite 12345
yields our new results
{
"title": "The wonderful world of data collecting",
"authors": [
{"family": "Brown", "given": "Jules"},
{"family": "Brown", "given": "Verne"}
],
"author_display_names": "Brown, Jules and Brown, Verne"
}
We can transform a single record but how about transforming the entire collction? That turns out to be easy we just loop over each key in the collection applying our pipeline.
dataset keys | while read K; do
dataset read "$K" | jsonmunge -i - flatten.tmpl | dataset join overwrite "$K"
done
Where we had “12345” before we now have "$K"
. The rest is just waiting on the computer to finish.