You have a dataset collection of JSON documents but the fields you’re interested in are nested. Using dataset and datatools’s jsonmunge you can reshape your existing collection’s record to the shape you prefer.
In this how to we will look at mondify a single record then once we have the record looking the way we want apply that transformation to the whole collection.
In our collection we have record with an id of 12345. Running
dataset read 12345
we can see our record looks like–
{
"title": "The wonderful world of data collecting",
"authors": [
{"family": "Brown", "given": "Jules"},
{"family": "Brown", "given": "Verne"}
]
}
What we’d like is a flattened version of the author names.
{
"title": "The wonderful world of data collecting",
"author_display_names": "Jules Brown and Verne Brown",
...
}
We’re going to pull out each others name object and then format them the way we prefer. jsonmunge lets us apply a Go text template to our JSON data and then output something. In our case our formatted names.
Generating our .author_display_names
field can be broken down into
simpler parts. First we are going to look at formatting a single name
and then look at how to format both names and finally format an number
of names. Inside .authors
array we have a name object. It has
.family
and .given
attributes. A simple template would reach in
to the .authors
array by index and then order the .given
and
.family
attributes as desired. Array indexes count from zero so the
first author’s index is zero. The template function dotpath lets
us reach inside the array.
Try this
dataset read 12345 | \
jsonmunge -i - -E '{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }}'
Let’s take this command pipeline apart. We retrieved our dataset
record 12345 with dataset read 12345
. We send that record to
jsonmunge (-i -
is idiomatic of datatool commands for saying read
from standard input since the record should be coming from dataset’s
standard output) and the -E
to evaluate a simple template ordering
out first author name.
Jules Brown
It’s a bit ugly (and long) but we can adapt that to display both names.
dataset read 12345 | \
jsonmunge -i - -E '{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }} and {{ dotpath . ".authors[1].given" "" }} {{ dotpath . "authors[1].family" "" }}'
getting
Jules Brown and Verne Brown
That command line is getting pretty long. Let’s take that expression and put it in a template file called “flatten.tmpl”.
{{ dotpath . ".authors[0].given" "" }} {{ dotpath . ".authors[0].family" "" }} and {{ dotpath . ".authors[1].given" "" }} {{ dotpath . "authors[1].family" "" }}
Run the template and see the results with
dataset read 12345 | jsonmunge -i - flatten.tmpl
We should again see
Jules Brown and Verne Brown
What happens for the next record where the number of authors is
different? Looking at our original data we see that .authors
is an
array of objects. Go’s text templates have a function called range
which makes it easy to iterate over arrays. range can return the
index value as well as the object at that index. Applying the range
function would look like this version of “flatten.tmpl”.
{{ range $i,$author := .authors }}
{{ if (gt $i 0) }} and {{ end }}
{{ $author.given }} {{ $author.family }}
{{ end }}
Running
dataset read 12345 | jsonmunge -i - flatten.tmpl
we get
Jules Brown
and
Verne Brown
That sorta gives us what we wanted but the spacing is all wrong and we
have some extra line breaks. We could put all the template parts in
one line but that would make it hard to read and debug. Fortunately Go
templates elements and indicate if leading or trailing whitespace should
be trimmed. You do that by using {{-
and -}}
for trimming leading
and trailing whitespace. The revised template will look like
{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.given }} {{ $author.family -}}
{{- end -}}
and a running that through
dataset read 12345 | jsonmunge -i - flatten.tmpl
gives us
Jules Brown and Verne Brown
Ok, so how does this help us reshape our origin 12345 record? Well first we need to turn our string “Jules Brown and Verne Brown” into an object. Updating our template the curly brackets and attribute nations gives us
{
"author_display_names": "{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.given }} {{ $author.family -}}
{{- end -}}"
}
Now running dataset read 12345 | jsonmunge -i - flatten.tmpl
gives
us our new object.
{
"author_display_names": "Jules Brown and Verne Brown"
}
Now we ready to “join” our new object with the 12345 record. We can do that by extending our pipe line.
dataset read 12345 | jsonmunge -i - flatten.tmpl | dataset -i - join update 12345
We can check to make sure it worked with dataset read 12345
. You should
see something like (order of attributes may vary)
{
"author_display_names": "Jules Brown and Verne Brown",
"authors": [
{ "family": "Brown", "given": "Jules" },
{ "family": "Brown", "given": "Verne" }
],
"title": "The wonderful world of data collecting"
}
Notice we now have a new .author_display_names
attribute in our object.
We still see our old .authors
. The join function will not overwriting
fields nor trim others. It adds the attributes of one object to the other.
Now let’s say you decide you’d rather have the names in “family, given”
order for the individual names. Using join -overwrite we can replace
the value in .author_display_names
with a new one.
flatten.tmpl should now look like
{
"author_display_names": "{{- range $i,$author := .authors }}
{{- if (gt $i 0) }} and {{ end -}}
{{- $author.family -}}, {{ $author.given -}}
{{- end -}}"
}
Running
dataset read 12345 | \
jsonmunge -i - flatten.tmpl | \
dataset -i - join overwrite 12345`
yields our new results
{
"title": "The wonderful world of data collecting",
"authors": [
{"family": "Brown", "given": "Jules"},
{"family": "Brown", "given": "Verne"}
],
"author_display_names": "Brown, Jules and Brown, Verne"
}
We can transform a single record but how about transforming the entire collction? That turns out to be easy we just loop over each key in the collection applying our pipeline.
dataset keys | while read K; do
dataset read "$K" | \
jsonmunge -i - flatten.tmpl | \
dataset join overwrite "$K"
done
Where we had “12345” before we now have "$K"
. The rest is just
waiting on the computer to finish.