Developing complex systems is easier than ever before. Plenty of frameworks or component systems not only help developers at the time of writing the code, they often also bring in tools to help during deployment and operation stages.
Containerization using Docker or similar technologies provides us with the amazing possibilities of deploying software to our local developer machine, to on-premises systems and the cloud in a simple, always similar fashion. Developing an application moves closer to the how you’d run it in production.
On the other hand, Continuous Integration and Continuous Delivery, together with resource managers and orchestrators, bring repeatable builds and automatic deployments to our canary or production environments.
And finally, the cloud(s) bring much more physical hardware resources and better reliability. Not to forget though, faster, more reliable and redundant internet uplinks.
The world couldn’t be better, could it?
Failure is Part of Life
It could, without system failures, software bugs, network issues. Everything would be as awesome as in the Lego™ movie.
The unfortunate truth though, failure is part of our daily business. Failure is part of our (professional) life.
Everyday thousands of hard-disks or SSDs fail. Uplinks loose connectivity and traffic must be re-routed, or power is lost in data centers. Software bugs hit us hard or a simple human error deleted all of the data table’s content. Sh…tuff just happens.
It’s important to understand that every system has issues, some are homemade, some are out of our influence. Some crash applications, some crash NASA rockets.
Embrace the Failure
Giving up is not an option though, therefore the correct solution to the only constant in life (next to PI), the failure, is attacking it straight up.
It is important to understand there is always a problem and to embrace it. To embrace in the sense of not being afraid of issues, not feeling the fear of problems, but thinking ahead one or two steps.
Because, if you embrace the failure, you can do everything. Let me repeat and proof that, with this small video; YOU CAN DO EVERYTHING!
Building Resilient Systems
Alright, back to the beginning: building reliable and resilient systems is hard.
The biggest part of making a system resilient is to never forget about possible failures and plan on how you could mitigate them.
If this seems impossible, try to understand what the best and fastest way is to find a problem, analyze and fix it.
And believe me, there are plenty of situations where it is literally impossible for you to prevent failures from arising, or do you have your own power plant? ?
Mitigation of failures includes a lot of planning ahead. It starts with the framework being used to develop and maybe one or more separate cluster middleware solutions.
After that it heads straight over to the different database technologies being considered. And let me remind you, a relational database might not always be the best option. If something doesn’t need ACID compliant consistency, eventual consistent services might be perfectly fine. Those are most commonly optimized for higher availability and run in clusters.
Onwards to resource management environments like Kubernetes, Docker, and similar systems, deployment should be fast, simple, repeatable and if possible, the same process wherever you deploy to. Abstractions like Kubernetes, OpenShift or Cloud Foundry are perfect for that kind of scenario.
Finally, also think about where you deploy. Deployment environments like on-premise systems and / or cloud providers, and how issues with hardware or cloud services could be mitigated.
For that reason, resilient systems sometimes use multiple possibilities altogether, or support quick migrations from one environment to another.
That said, any system I’d build starting today would support multiple cloud environments as it’s basic DNA. Remember, Kubernetes or other resource managers ease this process to a level of “almost don’t care”.
Find, Analyze, Fix
To quickly discover issues or bugs in the system, the infrastructure, or the network, a good monitoring and performance management solution should be in place.
Obviously, I’d always recommend Instana. I’m not recommending Instana because I work for them, it’s more the other way around; I work for them because the system is amazing. The swiss army knife.
The reason is simple: Automation.
Instana automates monitoring of the infrastructure, your cluster environments and applications or services. It dynamically recognizes started or stopped services and adjusts itself to the new situation. It instruments services on-the-fly (no code changes necessary for many programming languages) and captures not only traffic between services but traces requests all the way through the system.
Anyhow, any monitoring platform is better than none ?
Everything can be Awesome
To make sure our systems are fully working (at least from a user’s perspective) and our customers are happy, a lot of thought needs to go into the upfront planning and selection of tools. Oh and stick to the plan! ?
When deployed, keep a close eye on the systems, the infrastructure, the applications.
And if you’re lucky and everything is green at some point in time, have a beer. Be quick though, it’s not going to be green for long.
Oh, and if I could get you interested in Instana (and if not, you should be – seriously), get a free 14 days trial right here. Install it to a test environment and let Instana do its (black) magic, you’ll be blown away. I promise ?
Last but not least, remember failure is always an option, just make sure nobody notices. That’s why we run clusters of things across multiple availability zones and use techniques like circuit breakers. More on those topics and techniques will be in the focus of upcoming blog posts. Stay tuned.