Embedded data meets big data – sqlite in spark, are you crazy ?

Everyone of these new runtimes I go deep on, or distributed system, takes me to the edge of sanity until I punch through.  My recent experience with Spark was no different, but I finally punched through.

We have an existing ETL system that pulls views from our databases.  Out of the top 10 list here, we use 8 of them.  Most of these views are 500,000 rows or less (95%), but the top 5% which get used the most may have an many as 10,000,000 rows.   100 or more columns is normal as well.  These are wide flattened tables.  People use them to download and do offline analysis, simple introspection in a web view, but its also starting to seed our data lake.

When we explored building the first generation of this system, we looked at all the popular open source databases for speed at being able to arbitrarily filter, group, and sort wide tables up to 10 million records and 200 columns wide.  We looked at postgresql, mysql, and sqlite.  We use elasticsearch quite a bit, but ruled it out for this use case due to cost (lots of ram) and time to write, those are maybe things we could have overcome, but decided not to.  We wanted to be able to avoid creating indices because we didn’t know how users were going to look at the data.  Our API is similar to elastic’s where you can basically, filter, sort, and group on any dimension.   SQLite won the war, if you turn off transactions and journaling, you can get amazing performance for what we needed, for nearly everything even on commodity boxes you can see sub second response times across the board.

The first version used Storm to extract the data into Cassandra.  The next version was going to use Spark to extract into parquet and sqlite as well.  We’d have Spark simply prepare our sqlite files for us, so that we could drop them in S3 and have the API layer pick them up and do its thing.  This saved us cost of running and managing Cassandra, and started to prepare our data lake with parquet files, which from Spark to Presto is where you want to be.

Spark dropped the lines of code in half because it has drivers for all our data sources and we can avoid all the giant pattern matching blocks to deal with typing issues, as the data frame just does this for you.  In addition, because of clever partitioning with our extraction we were able to achieve more parallelism out of the source.   We were able to see extraction rates go from 1500-3000 rows per second to 8000-30,000 rows per second depending on the width of the table.  So that’s a huge improvement we can get 10 of millions of rows out in minutes.  Less code, fast, strategic for the future, good stuff!

Now the bad…

Everything was going swimmingly, until we started testing the larger datasets and we’d see the sqlite creation part stalling in Spark.  No errors, the job would just hang forever and have to be killed.  Sometimes they would work, but usually not.  We got stuck.  Enter the tuning and runtime learning process…

When you submit a job to yarn,  you start learning about executors and cores.  The vernacular here I found confusing.  Because an executor is a process and a core isn’t a CPU, its a task, or a thread.  So when you tell spark-submit you want –executor-cores 5 and –num-executors 4.  You are requesting 4 jvm processes, that will each run basically 5 threads.  From my familiar Storm world this translates to workers (executors) and task (cores).  There is also a driver process that is separate from the executors, more on that later…

I read everything I could about that and tried experimenting with the number of each and couldn’t move the dial on my problem.  The size of the cluster in terms of nodes or CPUs or executors or cores didn’t make any difference, the problem still occurred.

It felt like a memory problem, maybe the system was tied up in GC hell spinning.  The Spark UIs reports some of this kind of thing, but nothing glaring was showing up there or from various metrics.  But regardless, I started experiments with the –executor-memory setting.  But again, that didn’t move the ball.

So let’s talk about dataframes and storage for a minute…

When you pull data into Spark into a dataframe or RDD, you are basically pulling it into memory across your cluster with pieces of the dataframe spread out among your executors.  So your data is not all on one machine.  When you ask Spark to filter or group or otherwise transform your dataframe, Spark will spawn this work out to the various executors so they can all work on the problem on the piece of the dataset they own.

When you run Spark in cluster mode you are relying on HDFS storage, if you use EMR in AWS like we do, you can make this end up in S3.  But regardless, in a cluster environment where you are writing files, you have to have some type of shared network storage solution.  Now come back to SQLite.  SQLite drivers don’t understand HDFS, you can’t directly talk to it, you need to write to local storage.

So the situation here is that you need to bring all your data back to one node, in order to write to this local file.  No problem Spark will let you do a collect() or a toLocalIterator() that brings all the data back to the driver.  This is the piece that would stall, the work that happened after the collect().  Well, here is where my stupidity came into play.  In turns out all my fudging with the executor memory was doing nothing about the driver memory.  As it turns out the default driver memory is only 500MB.  That’s when I realized omg, I’m an idiot.  There is a another spark-submit option for the driver memory, which for whatever reason doesn’t seem to end up in examples or stack overflows on this subject (hence this blog…)  Once I set this to something large enough to handle the SQLite data, my problem went away, see full example that follows.

So now I’m in the happy zone, my hundreds of views are happily extracting and I can get onto the next step which is incremental pulls from source.  Then I will go from 10 million to 100 million and beyond.  I let you know how that part works out, its only in my head right now…

Advertisements

One thought on “Embedded data meets big data – sqlite in spark, are you crazy ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s