Reindexing a field in Elasticsearch with curl and jq

So now that I’ve been converted over to functional programming thanks to Scala.  I’ve had the profound realization (profound to me, probably obvious to most) that Unix has been doing functional stuff with piping long before it was cool.

I’ve been working a lot with Elasticsearch over the past few years, and curl is a fantastic tool for interacting with Rest endpoints.  But the problem is most endpoint these days use JSON.  So in the past if I had to manipulate the JSON in anyway, I’d run to real programming language and use a client API.  It turns out there’s this great unix tool called jq that has virtually everything you need to manipulate JSON, and make it easy to pipe it forward to other endpoints or whatever you are doing.  It’s functional in nature and really concise.  It has map, reduce, and ton of other built in functions, a compact syntax for filtering JSON and selecting elements.  It’s easy to work with arrays and objects and transform them, just about as easy as it is Javascript, maybe even easier in some cases.

The first thing I used it for was adding a timestamp to the JSON returned from one Rest call, so I could feed it into Elasticsearch:


curl -s http://localhost:9999/admin/offsets/fps_event/search-centralconfig_prod | \
jq –arg timestamp `date -u –iso-8601=seconds` '.timestamp = $timestamp' | \
curl -s -X PUT -H "Content-Type: application/json" -d @- http://localhost:9200/kafka-offset-log/offset/`date +"%s"`

view raw

gistfile1.txt

hosted with ❤ by GitHub

What’s going on here?

I wrote an endpoint that return back some information from Kafka, but I wanted this data in Elasticsearch so I can make some cool graphs in Kibana.  But there wasn’t a timestamp in my original design, and I really need that in Elasticsearch to do my time based reports.

You can see that you can feed it name/value pairs, via –arg, so I used the unix data cmd to format the date I needed.  Then it’s a matter of just ‘.timestamp = $timestamp’ to add a new field.  The dot essentially is a reference to the root of the JSON object. Did that just reinvent 1/2 of what logstash in one line ?  Don’t tell them they might get mad.

The next day feeling pretty good about jq’s coolness, I ran into another problem where I needed to reindex some data in certain documents in Elasticsearch.  Once you have gobs of data in Elasticsearch you need to be smart about updates and reindexing as things can take a long time.  In this case we needed to modify an existing field, and add a new multifield to it in order to improve some searches we were doing.  So first I used curl to close the index, adjust the settings on the index to add my new analyzer, then I reopen the index and add the new mapping.  That’s basic stuff, all easily done with curl or whatever client API you like.

But now I want to update the indices that had the problem, but only update this one field.  Our docs are pretty big, we don’t want to waste time reading and writing the whole thing.

Here’s the whole thing before I break it down:


updates="1"
while [ $updates -gt 0 ]
do
updates=`curl -s http://localhost:9200/fps-fbs_c94c36b8-e425-4bb8-b291-4b34f96d74ca*/_search -d '{
"size" : 500,
"_source" : ["FbKey"],
"query" : {
"filtered" :{
"query" : {
"multi_match" : {
"fields" : ["FbKey"],
"query" : "0",
"type" : "phrase_prefix"
}
},
"filter" : {
"missing" : {
"field" : "FbKey.no_pad"
}
}
}
}
}' | \
jq -c '.hits.hits[] | { "update" : {"_index" : ._index, "_id" :._id, "_type" : ._type}}, { "doc" : {"FbKey": ._source.FbKey}}' | \
curl -s -XPOST –data-binary @- http://localhost:9200/_bulk | \
jq '.items | length'`
echo "updated $updates records"
done

view raw

gistfile1.sh

hosted with ❤ by GitHub

So first we have the query to find all the affected docs by looking for docs that were missing the new field I added and have the unique data issue (a bunch of padded zeros) that we want to index better.  If you don’t know what Elasticsearch results look like this will give you an idea:


{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 354173,
"max_score": 1,
"hits": [
{
"_index": "kafka-log",
"_type": "kafka-event",
"_id": "d7d19ddc418415d5b59efeb857c4339c3980b171870435abec443772c85921ac",
"_score": 1,
"_source": {
}
}
]
}
}

view raw

gistfile1.json

hosted with ❤ by GitHub

Then jq comes in.

jq -c .hits.hits[] | { “update” : {“_index” : ._index, “_id” :._id, “_type” : ._type}}, { “doc” : {“FbKey”: ._source.FbKey}}

That one line builds up the update bulk update request, in this case 500 requests at a time.  So lets look at what is going on here.  We grab the results from the search request, as an array

hits.hits[]

Then we pipe that (all in jq) to an expression that builds up the bulk request.  The expressions like “._index” are actually selecting bits of data out of each hits.hits[] object as we pipe through it.

{ “doc” : {“FbKey”: ._source.FbKey}

That part is where we update the affected field, so that Elasticsearch will reindex with our new analyzer and mappings configured.

We then feed all this jq business to the next curl cmd to send it into Elasticsearch as a bulk request.  At the end “.items | length” looks at the result from the bulk cmd and tells us how many records we modified.

So we just keep running this until we run out of docs to update, meaning we’ve fixed everything.

Pretty sweet way to manage your data in elasticsearch with just curl and jq!  Now go get all Unix’y with your next Rest/JSON administrative task.

Maven Release Plugin Horrors with git

If you have spent any time trying to use the Maven release plugin, you probably hate it.  It wants to run your build twice, typically you have to wait 20 minutes to hours depending on the size of your project to figure out that it barfed and then try something else, barf again.  At some point you either dig in really hard, or start looking for other solutions.

As is just about typical with every major block I have, just when I’m about to kill myself, my dog, or anyone that comes near me, I figure things out.

So my story is this, we have the Maven release plugin working great locally, but doing releases from laptops is not ideal, and of course on the weekend I needed to do a release and the VPN was painful to run the test suite over.  I decided it was about time to get this thing working in jenkins, how hard could it be, throw it up there, wire in the maven goals, bada bing right ?

Not so much.

Ran the release goals, everything is awesome, then it goes to push the thing back into git:


[ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:2.5.2:prepare (default-cli)
on project trax-platform: An error is occurred in the checkin process: Exception while executing SCM command.
Detecting the current branch failed: fatal: ref HEAD is not a symbolic ref -> [Help 1]

view raw

gistfile1.txt

hosted with ❤ by GitHub

I do some googling, find out that jenkins checks out git code into a detached head, so you need to deal with that.  After a few failed attempts doing things that seemed much more complicated then they needed to be I landed on this:

http://stackoverflow.com/questions/11511390/jenkins-git-plugin-detached-head

Basically you need to just pick the “check out to specific local branch” option and set your branch.  Grab a beer, wait another 25 minutes and then bamn!:


[ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:2.5.2:prepare (default-cli) on project trax-platform: Unable to commit files
[ERROR] Provider message:
[ERROR] The git-push command failed.
[ERROR] Command output:
[ERROR] fatal: could not read Username for 'https://github.com': No such device or address
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.

view raw

gistfile1.txt

hosted with ❤ by GitHub

Well this one got got me stuck forever.  Convinced it was some credential problem in jenkins trying all sorts of things messing around with the git plugin, credentials, you name it nothing helping.  Started pondering on “this is why people hate Java”…

Then finally I was sitting there staring at that url,

could not read Username for 'https://github.com'

Thinking about where have I seen that before somewhere else.  Oh yeah, its in the maven pom.xml:


<scm>
<developerConnection>scm:git:https://github.com/TraxTechnologies/PlatformComponents/</developerConnection&gt;
<tag>HEAD</tag>
</scm>

view raw

gistfile1.txt

hosted with ❤ by GitHub

I changed the git url to this:


<scm>
<developerConnection>scm:git:git@github.com:TraxTechnologies/PlatformComponents.git</developerConnection>
<tag>HEAD</tag>
</scm>

view raw

gistfile1.txt

hosted with ❤ by GitHub

And hell yeah it finally worked.  It makes sense if you think about how git is handling security that the http protocol isn’t going to work here since since we are detached from the original git pull of the code.  Anyway, it was much pain and suffering and I couldn’t believe stackoverflow and google let me down, it took way too much time to figure this out.  That is why I’m blogging about it, so some other poor sorry soul won’t have to kill their dog over it.

As a side note, I ran into this while looking for a solution:

http://axelfontaine.com/blog/final-nail.html

This guys approach seems pretty interesting.  It eliminates the two builds and seems really simple and straightforward to implement.  I was just about to try it out, when I got the past this.  But for sure when I find some time, or next time this poops out, I might be giving that a go.

For the record, I have other builds using the SBT release plugin, and I went really deep on that too, I have battle scars from that one too, a tail for another day…