Building reliable services is not an easy task. With looking into creating a resilient microservices based infrastructure, things won’t get any simpler.
As part of this blog post series I’d like to introduce common concepts, clarify misconceptions and generally help building better software. In the previous blog post I talked about the importance of understanding that you cannot build a failure free system. Failures are going to happen, think about them early.
Zombies are Alive
Communication is hard, not only between different individuals, but also between microservices. There’s a lot that can go wrong, from hardware- or software failures, over datacenter issues or just human error. Nobody’s perfect. Nothing’s perfect.
For highly available microservices environments, it is already common knowledge to have more than one instance of a service running. In the best case we have them fronted by a load balancer (which is yet another topic for a blog post all in itself). From now on the load balancer takes care to route our requests to the independently running instances. Instances should be distributed across multiple data centers and/or regions. Just to be on the safe side.
Still it can happen that the service we’ve been routed to, is marked as “alive”, but in a so-called zombie state. A situation which happens if the process is still alive, meaning the PID (process id) is available, but for some reason (like an endless loop or similar), it isn’t responding to requests anymore.
That said, our request will eventually time out, or get an error message from some proxy or the load balancer itself.
A great example of this are the Cloudflare timeout pages if a server doesn’t respond over a certain period.
Cutting off Power
To mitigate issues like that we can utilize a pattern called Circuit Breakers (CB). A pretty familiar concept when it comes to fuse boxes in houses and homes.
The idea is simple, if a service has errors, you cut it off before even trying to contact it. This is normally implemented using some threshold, since a single error, occasionally, can always happen.
The flow of a CB is as following:
Service A sends a requests service B. B on the other hand has some issue and cannot respond. The request times out. Service A may retry, as it could be a temporary problem only. After a few more failed requests, the error threshold of the CB is reached, and the “switch” is flipped. Meaning, no further requests will be routed to that service for the time being.
Gradually over time the CB will try to forward requests to Service B to see if it reacts again. However, if there are still errors, it stays in a flipped state.
With that in mind, most CB implementations have a way to configure a fallback logic, which kicks in if the switch flips, and is used going forward, while still in flipped state.
There are multiple ways to build a meaningful fallback, for such situations. First, I’d normally recommend to chain at least two Circuit Breakers, which means, if the first one flips, the second one becomes active.
This is a common pattern when using multiple data centers or regions. The first CB is configured to always use the closest endpoint, for latency reasons. If, however, this endpoint is not reachable, the second CB will try to hit another endpoint with the same service running. In another location though. Further away.
The trade-off here is low latency in normal situations for the price of additional round trips when something happens. A pretty meaningful trade-off from my perspective.
But what happens in the (hopefully) unlikely case that both endpoints are broken, like a failed deployment, or a broken build?
Well you either keep chaining (not always a good idea), or you give the users some feedback and let them know that the service is currently not available.
<side story off>
Btw – true story – don’t tell the user “to retry”, because they will. Have you ever had this message presented to you? Did you find yourself bashing F5 (CTRL+R)? Me too.
Let the users know you monitor the systems. Let them know you’re working on fixing the issue and tell them to come back later. The important bit is later.
If you want to be more concrete, say 15 minutes, say an hour, just not now ?
</side story off>
Timeout for the Masses
Anyway, Circuit Breakers are a great tool to prevent our services from trying to reach broken endpoints repeatedly, but it’s just one side of the coin.
Many services have SLOs (Service Level Objective), which normally include a maximum response time . That said, if our service is slow or doesn’t react, we might not have the time to wait for the request to time out.
So, it seems we need something more, and the second major tool introduced in this blog post is timeouts. Timeouts are the swiss army knife to prevent breaching time-based SLAs.
Imagine we have a time budget of 250ms to get a result from a service. Splitting it up to gives us the most of it.
We have two service endpoints in different regions and give each endpoint about 100ms to answer. If one fails to respond fast enough, we kill off the request and contact the second endpoint to give it another 100ms.
If this one fails too, we have yet another 50ms to come up with a solution before breaching the SLA.
In many cases, sending out a slightly out-of-date value from a cache may be a viable solution. In other cases, we must spin up an error message for both the logs and the user.
Since all coins have 3 sides (top, bottom, ring), there’s also a third important bit of the same kind, Back-off Algorithms. That topic, however, is more complicated and will become its own blog post.
Just to give a quick insight into the idea, imagine a full subway and you have to wait for the next one. Some people might get still in, but the majority doesn’t. A deeper introduction into Back-off Algorithms will follow though.
While waiting for the next entry to the series though, you could already start monitoring your services and infrastructure with Instant’s 14 days free trial and see how new releases, code changes and above ideas influence their behavior.