November 2015 – scalalala

So if you are new to Elasticsearch you may naively think it’s schemaless. You’ll start jamming stuff in all willy nilly and then eventually either at index time or during query time you’ll start run into some nasties: NumberFormatException, MapperParsingException, or SearchParseException.

You’ll trace these to mapping issues, which sometimes you can fix, if you reindex all your gops of data, yuck! If you don’t know the types of problems I’m talking about read this, that guy goes into some good examples that are representative of the types of things we’ve run into as well. The skinny is that Elasticsearch isn’t schemaless at all, quite the opposite. It tries to infer your types for you implicitly, which is convenient, but when it gets it wrong, you’ll find that the only way to fix it is to be explicit, which doesn’t sound like schemaless at all does it?

A great example of when this can go wrong is with a float or double. If the first value you send in a doc without a mapping is a zero, Elasticsearch will type that as an Int. Well, then a value of 1.2 comes along, and guess what, the type for that field was selected as an Int, so you just lost precision as its rounded down to a 1 in this next doc.

We tried something similar to the Orchestrate solution, but what we found was that by moving everything into arrays you had to compromise on the aggregations you are able to do. There are certain types of things that you can’t do with arrays in Elasticsearch. For our use case this was a bit of a show stopper.

So here is what we did. It’s a combination of multifields, dynamic mappings, templates, and ignore malformed.

You can use multifields to control ways that you might index a single field differently. Commonly you’ll see this to store the raw non_analyzed data alongside the analyzed data.

	{
	"tweet": {
	"type": "string",
	"analyzer": "english",
	"fields": {
	"raw": {
	"type": "string",
	"index": "not_analyzed"
	}
	}
	}
	}

view raw

test.json

hosted with ❤ by GitHub

So with a mapping like that you can refer to the field as “tweet” when you want to search using the analyzer, or use “tweet.raw” for aggregations, or sorting, or other times when you don’t want to analyze the field.

Dynamic mappings are ways you can apply mapping rules to fields that haven’t been mapped, based on name or type or other rules. In a sense it allows you to configure how the implicit type determination happens.

Templates are just a convenient place to store your mapping rules so that they might auto-apply to new indices. If you are creating a lot of indices this is a huge help, it will help simplify your client code bases as they won’t have to create or alter the mappings for indices anymore.

The ignore malformed is a way to tell Elasticsearch to swallow any errors if the mapping don’t work, instead of blowing up. This one was the real key to our solution, because it meant you can try to cast everything into every type in a multifield at index time, if it doesn’t work it won’t be there, but nothing blows up.

So putting that altogether you get something like this:

	{
	"template1":{
	"order":0,
	"template":"my_indices_*",
	"settings":{

	},
	"mappings":{
	"_default_":{
	"dynamic_templates":[
	{
	"all_strings":{
	"mapping":{
	"index":"analyzed",
	"type":"string",
	"fields":{
	"raw":{
	"index":"not_analyzed",
	"type":"string"
	},
	"date":{
	"index":"not_analyzed",
	"ignore_malformed":true,
	"format":"yyyy-MM-dd HH:mm:ss\|\|yyyy/MM/dd HH:mm:ss\|\|yyyy/MM/dd\|\|yyyy-MM-dd\|\|yyyy-MM-dd'T'HH:mm:ssZ",
	"type":"date"
	},
	"double":{
	"index":"not_analyzed",
	"ignore_malformed":true,
	"type":"double"
	},
	"long":{
	"index":"not_analyzed",
	"ignore_malformed":true,
	"type":"long"
	}
	}
	},
	"match_mapping_type":"string",
	"match":"*"
	}
	}
	]
	}
	},
	"aliases":{

	}
	}
	}

view raw

gistfile1.txt

hosted with ❤ by GitHub

So what this means is that you can essentially autocast any field to its raw value, as a long, double, or date (add other types if you need them) in any query, filter, or aggregation.

tweet
tweet.long
tweet.double
tweet.date
tweet.raw

Why this is useful? Well aside from dealing with all the various types of mapping errors you will run into on gops of data, it’s really helpful in analysis.

One of our use cases for Elasticsearch is doing early adhoc data analysis, to figure out what we have before we know what we have. We don’t know what types we might be getting, and in fact we are often looking for patterns of reliability or data quality in the data, so being able to quickly determine when and when we can’t cast into specific types is very informative. Using aggregations on these typed multifields allows us to do that. This one mapping will do this for any deeply nested docs and still allows you to run complicated aggs and filters on your nested data.

	Prasanna Kumar on Maven Release Plugin Horrors w…
	My latest tech journ… on Embedded data meets big data…
	Alistair Cross on Logstash and Playframework
	zouzias on Logstash and Playframework
	rick on Logstash and Playframework

	Prasanna Kumar on Maven Release Plugin Horrors w…
	My latest tech journ… on Embedded data meets big data…
	Alistair Cross on Logstash and Playframework
	zouzias on Logstash and Playframework
	rick on Logstash and Playframework

scalalala

journeys from Java to Scala

Month: November 2015

Typing Utopia in Elasticsearch