Writing software services and applications has never been easier. An unbelievable number of tools, frameworks and programming languages is available to us as a developer, and new ones show up every other week.
Even complicated topics, like Distributed Computing, are supported by more than just one library. In the best case we get a whole system of components and just plug them together like Lego™.
Resilient? To What?
Building a system which is not only scalable but also resilient to failure situations, is another challenge though.
Making a system resilient to any kind of failure, is a problem which is not handled by a single framework. Plenty of tasks to support us in making it possible, however, are provided.
To understand what resilient means in the context of this post, we need to define our problem space. What are the potential failures we want to prevent?
First of all we begin with the simple power outage. It can happen due to an unplugged power cord, a hardware failure or a misbehaving emergency power system.
And since we already talk about hardware, what about issues with switches, routers, and similar network components
Furthermore, we want to prevent user impact, whenever an operating system, orchestration layer or resource manager has an issue.
And finally, not that anyone’s software would have them, software bugs and failures. I mean, hey, it worked on my machine, doesn’t it? ?
The Stages of Resiliency
Resiliency is a cross cutting concern. From development, over operating the infrastructure, all the way down to the service or cloud providers, everything needs to be considered.
Development effort and wise choice of supportive tools and frameworks, is one of the most important parts of a resilient system. Operations teams can work around and hide any problem here, but the system will never really be resilient.
DevOps, Operations, Infrastructure teams, however, are not any less important. Operating the infrastructure, preparing the runtime environments, automatically deploy or restart services, all this is necessary.
Automation is the key to a resilient and scalable system. Automate as much as possible. Automate everything you must do at least twice. It prevents error from manual interaction, and it saves time, a lot of time. Tasks that had to be done twice are normally coming back to you or a colleague at some point in time.
The third stage is the underlying service infrastructure, meaning the datacenter, the internet connectivity, the network hardware.
People often leave this out of the picture, but from my perspective it’s not as obvious as it seems. Don’t use a single datacenter. Don’t use a single machine. Don’t use a non-redundant internet or network system.
In case of deploying your system into the cloud, make sure you run on multiple Failure or Availability Zones. Even better, run on multiple cloud providers.
Also think about using more than one DNS provider. DDoS attacks, like the one against the Dyn service in 2016, showed us how “easy” it is to make half of the internet disappear for hours. Be in the other half. And when selecting a DNS provider, also make sure they use an Anycast network to route their DNS traffic.
On the flip side, with cost in mind, always calculate what would be the trade-off between user impact in situation X against the cost-factor preventing it. Make meaningful choices, spend money where the biggest impact occurs.
Embrace the Failure
Automation from xkcd
Before we dig deeper into necessary pieces, there is one major point, which is as cross cutting as resiliency itself, our motto: “Embrace the Failure”.
Working with scalable and large systems means, something will go wrong, somewhere, at any given moment. Failure is a part of your life, your business. Embrace it, learn to deal with it!
Developers, Developers, Developers
Develop any service with resiliency in mind. Especially in modern, scalable architectures using microservices, we have a lot of failure potential. Every single transition between services is also a network transition.
Networks tend to be unreliable for many reasons. Protocol implementations, buffer sizes, too many open ports (or files, thanks Linux), broken cables; there are just too many options to name them all.
Breaking the Circuit
To overcome these kinds of problems circuit breakers are the tool to use. Circuit breakers help to switch to fallback services or information, in the case of an outage of a depending service.
A common use-case for them is to cut the line if a certain percentage of connections run into errors. Errors like HTTP Status 500, errors like connection timeout, error like, you name it.
Another important bit is setting and managing timeouts and time budgets. That said, you may have multiple options to get data. Start with the database (to get live data), give yourself and the request a timeout of 150 ms.
After the timeout is reached, you have two options, let the user know that he must retry later (graceful degradation), or send the request to an alternative (maybe not fully up to date) service, like a cache. Important, give the alternative a timeout too. 100ms for example, afterwords kill the request and let the user know, something is very wrong.
If your time budget was 250 ms; you’re still made it.
In general, graceful degradation is an important jigsaw piece. If you really need a value from a depending service, but the service request fails, it is a good idea to use a Backoff-Algorithm.
This algorithm, also known as Exponential Backoff, is a way to spread processing load in terms of time.
To visualize it a bit more, if the first request fails, let’s retry right away. If the request still fails on the second run, maybe the depending service is overloaded. So, let’s give it a bit of air to breath. Send another retry only after 5 seconds. It still fails? Wait for 10 seconds to retry one more time, and so on. You gradually increase the wait-time between requests.
Obviously, it doesn’t work for all use-cases, but for machine-to-machine service communication this is an important bit to consider.
Btw, API’s often have a Backoff-Algorithm alike thing, called Rate Limiting. The idea, however, is similar.
Immutability and Idempotency
Another major factor to make systems not only resilient but also scalable, is to use immutable data, meaning, never change data, only evolve into new datasets. The very idea behind most of the functional programming language.
Not changing data means, no concurrency issues or locks when accessing the information from multiple threads, core or physical systems.
Together with Idempotent operations (retriable calls), especially across services, offers the benefit to literally just retry a failed call. When a service call returned a HTTP Status 500, when did it fail? What data was already mutated? Can I retry? All and more like those questions disappear with immutability and idempotent operations.
QA, DevOps and Operations
From my experience, it seems like resiliency is better understood in the Operations and DevOps teams. It is uncommon, not unseen though, that a team comes around with a single system, running hundreds of services with no backup, redundancy or failover.
Still, it is important to stress the fact, that certain things should be considered when operating a system meant to be ultra-reliable.
Multiple data center, Availability or Failure Zones were already mentioned. Redundancy, however, doesn’t end here. Clusters are important too.
Clusters of databases, clusters of Kubernetes, clusters of deployment units (e.g. services), clusters of load balancers, clusters of just everything. Do not keep single instances, a SPOF (Single Point of Failure) will bite you eventually.
Using clusters means, you may have to rethink which technologies can be used. Oh yeah Master-Follower (previously called Master-Slave) systems do count as clusters to me, at least when automatic failover is provided.
If you really cannot use a cluster or cluster alike system, backup often! Backup whenever possible, in the best case have some kind of PITR (Point-In-Time-Recovery), like PostgreSQL.
Anyhow, be prepared for failures. Remember to embrace the failure.
Oh, and since I cannot stress this enough, automate everything! Use orchestration, use repeatable builds, DO NOT deploy manually, Infrastructure-As-Code is your friend, it just goes too well with ?.
Behind Door #3
Our last stage, service providers, often needs most investigation and planning. Changing service or cloud providers, after the system is running, can be tedious, complicated, time consuming or simply impossible. Therefore, it is important to analyze the partner, especially in the context of resiliency.
Thankfully, redundancy of network infrastructure, as well as internet backbone shouldn’t be the major question when selecting a cloud provider. If it is, stay away from them, as far as possible.
It is more important to plan out how to distribute your system to prevent single failure points, like a single DNS service provider, a single Load Balancer endpoint and things like that.
Most of the requirements, when looking at selecting a service providers, is like a deployment to your own infrastructure. Start to plan as you’d deploy to your own data center(s). It’s a great start to figure out necessary redundancy points to take into account.
Wow, that was a lot to cover already. Still, it was only a subset of what would be worth mentioning.
Other topics include, but are not limited to, Continuous Integration, Deployment and Monitoring, caching of data, reactive architecture design, automatic scaling and restarting, root cause analysis, and many more.
Since every element to make a service reliable and resilient is worth its own blog post, we decided to start a series about: All Things Resiliency.
Over the next months we will publish blog posts to the topics mentioned above, but also things not being named yet.
Furthermore, I created a conference talk as a quick introduction into the topic of building a system for “Oops-Less Operation”. If you’re around CodeOne in San Francisco, Accento.dev in Karlsruhe, JCConf in Taipei or the JJUG’s CCC Fall in Tokyo, I’d love to see you joining the session and happy to discuss. I bet there are more conferences or meetups to follow.
One Last Thing
Before leaving you off after reading this lengthy text, there is one more important note; KISS, Keep It Simple and Stupid. Don’t fall into the trap of Premature Optimization!
Measure, test, optimize and don’t guess!