Before there was automatic back pressure in reactive/streaming systems, we spent a lot of time trying to tune our reads and our writes so that we wouldn’t over run our sinks (databases). At first its a weird unfamiliar place, a lot of us, like me, came from the web, where we were in this constant request/response cycle and the goal was to try to go faster. Support more load, quicker response times etc. Then when you get to streaming patterns, you realize the goal ends up being go slower. In many use cases, you end up being able to read way faster than you can write, and you end up killing your stores, or spending more money. I think a few times I came pretty close to melting CPUs. Now yeah, certain stores handle fast writes better than others, but lets face it, if you have a file with a bazillion records and you are just reading it, and send it to a store with a whole bunch of threads, there is absolutely no doubt, that the read is going to be faster than the write.
So what happens here is you have to tune your inputs to slow down to whatever rate your particular storage solution can handle. This was a painful and slow process of trial and error, turning knobs here and there, and staring at dashboards and metrics. You would go from slow and logical scientific careful procedures to F it, turn that shit up and just go fast. Eventually, you would end up somewhere stable and fast enough. It took a long time, there must be a better way.
At some point this idea of back pressure came around. From akka to flink, I think its a natural idea in a lot of frameworks. The idea is that if we can detect that writes (or any downstream activity) is slow, we can slow down the reads. So in effect that should eliminate all this time spent tuning, and cursing, and staying up late. Well, I think I learned something today. That yes its true frameworks like Flink have back pressure, and they will slow down inputs, but…
What if your source is keeping a connection open? Well that means as back pressure increases the likely hood of that connection timing out increases. This is what happened to us. Our database was becoming the bottleneck because of deadlocks and not optimal writing configuration this led to increased back pressure. One of our sources is s3. Well AWS S3 clients have a timeout, where if you don’t request any new data, they will close, and that exception can kill your pipeline. Yeah, there are retries and timeout settings, and yeah you can tune those, and struggle with how to get your framework to set those (sometimes its not as easy as you think). But aren’t you right back to tuning? Like did that back pressure actually solve your problem, or did it just move it somewhere else ?
Now sure we can find a way to engineer things to not keep connections open, otherwise really solve these problems, but the fact remains you still need to tune the whole thing end to end. Any little setting, or optimization matters. These things lead to real costs, especially in the cloud. It’s far to easy, and the cloud providers are super happy to just give you more ram and more cpu. Engineers, if the people with the purse strings let them, are often happy to just spend more. In the end this back pressure thing, while I think a great idea, isn’t a silver bullet, you still have to tune your in and outs. It’s not just about making it stable or making it fast, its about cost. That’s the bottom line, as a developer you are there to create value. You get paid to turn X resource (the amount they pay you) into X*Y. You want your Y to be as high as possible. Don’t just roll over and believe back pressure is gonna self tune your stuff, there is still hard work to do. We haven’t arrived at the place where streaming systems self tune themselves, we are still far away from that. But its a fun journey, enjoy the ride!