Reindexing a field in Elasticsearch with curl and jq

So now that I’ve been converted over to functional programming thanks to Scala.  I’ve had the profound realization (profound to me, probably obvious to most) that Unix has been doing functional stuff with piping long before it was cool.

I’ve been working a lot with Elasticsearch over the past few years, and curl is a fantastic tool for interacting with Rest endpoints.  But the problem is most endpoint these days use JSON.  So in the past if I had to manipulate the JSON in anyway, I’d run to real programming language and use a client API.  It turns out there’s this great unix tool called jq that has virtually everything you need to manipulate JSON, and make it easy to pipe it forward to other endpoints or whatever you are doing.  It’s functional in nature and really concise.  It has map, reduce, and ton of other built in functions, a compact syntax for filtering JSON and selecting elements.  It’s easy to work with arrays and objects and transform them, just about as easy as it is Javascript, maybe even easier in some cases.

The first thing I used it for was adding a timestamp to the JSON returned from one Rest call, so I could feed it into Elasticsearch:


curl -s http://localhost:9999/admin/offsets/fps_event/search-centralconfig_prod | \
jq –arg timestamp `date -u –iso-8601=seconds` '.timestamp = $timestamp' | \
curl -s -X PUT -H "Content-Type: application/json" -d @- http://localhost:9200/kafka-offset-log/offset/`date +"%s"`

view raw

gistfile1.txt

hosted with ❤ by GitHub

What’s going on here?

I wrote an endpoint that return back some information from Kafka, but I wanted this data in Elasticsearch so I can make some cool graphs in Kibana.  But there wasn’t a timestamp in my original design, and I really need that in Elasticsearch to do my time based reports.

You can see that you can feed it name/value pairs, via –arg, so I used the unix data cmd to format the date I needed.  Then it’s a matter of just ‘.timestamp = $timestamp’ to add a new field.  The dot essentially is a reference to the root of the JSON object. Did that just reinvent 1/2 of what logstash in one line ?  Don’t tell them they might get mad.

The next day feeling pretty good about jq’s coolness, I ran into another problem where I needed to reindex some data in certain documents in Elasticsearch.  Once you have gobs of data in Elasticsearch you need to be smart about updates and reindexing as things can take a long time.  In this case we needed to modify an existing field, and add a new multifield to it in order to improve some searches we were doing.  So first I used curl to close the index, adjust the settings on the index to add my new analyzer, then I reopen the index and add the new mapping.  That’s basic stuff, all easily done with curl or whatever client API you like.

But now I want to update the indices that had the problem, but only update this one field.  Our docs are pretty big, we don’t want to waste time reading and writing the whole thing.

Here’s the whole thing before I break it down:


updates="1"
while [ $updates -gt 0 ]
do
updates=`curl -s http://localhost:9200/fps-fbs_c94c36b8-e425-4bb8-b291-4b34f96d74ca*/_search -d '{
"size" : 500,
"_source" : ["FbKey"],
"query" : {
"filtered" :{
"query" : {
"multi_match" : {
"fields" : ["FbKey"],
"query" : "0",
"type" : "phrase_prefix"
}
},
"filter" : {
"missing" : {
"field" : "FbKey.no_pad"
}
}
}
}
}' | \
jq -c '.hits.hits[] | { "update" : {"_index" : ._index, "_id" :._id, "_type" : ._type}}, { "doc" : {"FbKey": ._source.FbKey}}' | \
curl -s -XPOST –data-binary @- http://localhost:9200/_bulk | \
jq '.items | length'`
echo "updated $updates records"
done

view raw

gistfile1.sh

hosted with ❤ by GitHub

So first we have the query to find all the affected docs by looking for docs that were missing the new field I added and have the unique data issue (a bunch of padded zeros) that we want to index better.  If you don’t know what Elasticsearch results look like this will give you an idea:


{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 354173,
"max_score": 1,
"hits": [
{
"_index": "kafka-log",
"_type": "kafka-event",
"_id": "d7d19ddc418415d5b59efeb857c4339c3980b171870435abec443772c85921ac",
"_score": 1,
"_source": {
}
}
]
}
}

view raw

gistfile1.json

hosted with ❤ by GitHub

Then jq comes in.

jq -c .hits.hits[] | { “update” : {“_index” : ._index, “_id” :._id, “_type” : ._type}}, { “doc” : {“FbKey”: ._source.FbKey}}

That one line builds up the update bulk update request, in this case 500 requests at a time.  So lets look at what is going on here.  We grab the results from the search request, as an array

hits.hits[]

Then we pipe that (all in jq) to an expression that builds up the bulk request.  The expressions like “._index” are actually selecting bits of data out of each hits.hits[] object as we pipe through it.

{ “doc” : {“FbKey”: ._source.FbKey}

That part is where we update the affected field, so that Elasticsearch will reindex with our new analyzer and mappings configured.

We then feed all this jq business to the next curl cmd to send it into Elasticsearch as a bulk request.  At the end “.items | length” looks at the result from the bulk cmd and tells us how many records we modified.

So we just keep running this until we run out of docs to update, meaning we’ve fixed everything.

Pretty sweet way to manage your data in elasticsearch with just curl and jq!  Now go get all Unix’y with your next Rest/JSON administrative task.

Leave a comment