Design for Scale: Identifying Performance Bottlenecks + Refactoring
Before any work can be done on improving the performance of our applications we must first decide what areas we want to improve. When analyzing data collected either in test or production you will want to focus on metrics which can help you decide whether or not your efforts to scale have been effective.
Can you find the bottleneck?
A bottleneck is the point in a system where contention occurs. In any system these points usually surface during periods of high usage or load. Once identified the bottleneck may be remedied bringing performance levels into an acceptable range. Utilizing synthetic load testing enables you to test specific scenarios and identify potential bottlenecks, although this only covers contrived situations. In most cases, it is better to analyze production metrics and look for outliers to help identify trouble on the horizon.
Key performance indicators from your application include request/sec, latency, and request duration. Indicators from the runtime or infrastructure also include cpu time, memory usage, heap usage, garbage collection, etc. This list isn’t inclusive, there may be business metrics or other external metrics which may factor into your optimizations as well.
Remedies may include any number of optimizations but usually result in refactoring, caching and data optimization, threading and/or workload distribution. We’ll talk more in depth about these topics in later series, for now I think it’s worth mentioning before we dive into refactoring that it’s first on this list for a good reason.
Once a bottleneck has been identified, there is a tendency for engineers to either scale services horizontally or increase the resources allocated to the component. This should only be done after carefully determining that the bottleneck is not resolvable through refactoring. For instance, issues that may be resolved with a functional change include how the service queries the database, implementing a batching mechanism for multiple API queries, or introducing a cache layer to handle localizing any frequent high-latency requests.
As you implement these new capabilities into your application you’re shaping what will become a component that can safely be replicated to handle additional workload. The work done refactoring builds a high degree of confidence that systems won’t be negatively impacted by the additional load. Performance issues that are left unresolved will be multiplied as you scale your services out.
Once you’ve optimized your services and you’re ready to begin scaling out a new set of challenges emerge which we’ll cover in future blog posts. At the center of all of this is measuring results and having a monitoring solution which can give you the feedback needed to make informed decisions about future growth and strategize around technology partners and implementations. In the next blog post we’ll cover how Instana can help you analyze performance, refactor systems, and measure the results with incredible accuracy.