AWS Lambda, the serverless functions (or FaaS) offering from Amazon continues to grow in usage, both overall and in production applications. One of the biggest challenges is how to trace and monitor application code that runs on production Lambda instances. This article discusses best practices for gaining and using observability into AWS Lambda Serverless Functions.
What is AWS Lambda
According to the official site for Amazon AWS Lambda:
“AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume.”
Currently, AWS Lambda supports development and deployment of functions in a variety of programming languages including Node.js, Go, Java and Python. And in a stroke of genius, Amazon also allows you to provide your own custom runtime to operate as a Lambda Serverless function.
The Value of AWS Lambda
With AWS Lambda, users can define functions that are executed on-demand, without having to worry about computing resources or what infrastructure is allocated to it. The abstraction of the infrastructure running your functions has a twofold goal:
Automate the scaling up and down of computing resources: As the load your application needs to serve grows on Monday morning, with people going back to their desks, AWS Lambda will automatically, behind the scenes, increase the amount of instances of your function. And when the load goes down, at the end of the working day, under-utilized instances are automatically decommissioned. And the promise of AWS Lambda is that none of this is really something that should concern you as a developer.
Eliminate unnecessary costs: Users pay only for the workload when functions are served up, and incur no costs when there is no workload.
Lambda Execution and Monitoring
AWS Lambda functions are executed synchronously to serve a specific request, or asynchronously, triggered by an event. One example of synchronous invocation would be when the AWS Application Load Balancer or API Gateway receives an HTTP request that is mapped to a specific Lambda function. Asynchronous invocation in reaction to events is growing more and more common, as the already extensive list of events that can trigger AWS Lambda functions keeps growing. Some types of events stand out as the most widely adopted:
CloudWatch events, which can be defined using simple rules that describe changes in AWS resources
S3 events, which are emitted when objects are created or deleted in S3 buckets
SQS events, which pass messages queued in SQS to Lambda functions for processing
The orientation to functions and events, and the fact that the function, itself, is the only concern in terms of its operations, has led to the adoption of the Function-as-a-Service (FaaS) moniker to describe AWS Lambda and similar serverless services.
Top Lambda Use Cases
While there are many unique reasons and cases that Dev and/or Ops would choose to execute workloads on AWS Lambda (it has, after all, proven to be a flexible service that caters to several scenarios), these are the top two use cases by far:
Prototyping and early-stage development: The lack of upfront infrastructure costs have driven the adoption of AWS Lambda for prototyping new products and capabilities, especially in start-ups and smaller outfits that do not want to devote personnel and resources to maintaining host-centric infrastructure like Virtual Machines. As products grow in maturity and adoption, we see our customers move from AWS Lambda to other computing platforms, like EC2 or Fargate. Two reasons appear, above all others, to drive the shift away from Lambda:
Cost: compared to other computing platforms, AWS Lambda has considerably higher costs for running large, continuous workloads where Lambda’s “scale to zero” benefit is not applicable
Growing complexity: functions should be relatively small and straightforward. As the product grows, the increasing complexity results in either a large amount of functions, loosely coupled with one another and hard to keep track of from an operational and architectural perspective, or the fewer functions become complex, with a lot of code running in them, and AWS Lambda does not shine in terms of debuggability
Business process integration: The fact that Lambda functions can be triggered by a variety of events makes them natural candidates for integration of (business) processes between systems. For example, at Instana we use Lambda functions for all types of automation, from Quality Assurance tasks like automatically provisioning infrastructure to test newest builds, to integrating our support portal to our project management systems, automatically creating work items for our engineers in response to support tickets opened by customers. The scale-to-zero capability of Lambda is particularly interesting in integration scenarios where the throughput of events is low (a few a minute or less).
There is also a growing interest in using AWS Lambda for machine-learning use-cases, especially in combination with AWS Sagemaker, but the cost aspect (similar to the discussion above) seems to be a limiting factor to the applicability at scale of this scenario.
The Dark Side of Lambda
No computing paradigm comes without trade-offs and rough edges, and AWS Lambda has shown, so far, the following:
Hard to debug: by design, the infrastructure of AWS Lambda is run exclusively by AWS, leaving you without much control in terms of debugging. AWS provides ways of running your Lambda code locally, but when you have issues in production, the out-of-the-box functionality you have to debug, for example logging into CloudWatch (a.k.a., “Cloud Printf” 🙂 ), is an exceedingly tedious and laborious endeavour, especially if you are scrambling to fix an outage. And since the cost of one AWS Lambda invocation is directly tied to how long it takes to complete it, bugs that get your AWS Lambda function stuck, like looping over large database result sets, are both hard to debug and costly.
Distributed complexity: complex scenarios often involve large amounts of Lambda functions loosely coupled to one another through events, and it really can be like trying to solve a million piece jigsaw puzzle to find out what functions were involved in serving which request and what went wrong.
“Stateless” means you pull state from somewhere else:
Lambdas, by their nature, are stateless. You cannot rely on any one Lambda function to retain state from the processing of the previous request. Thus, most Lambda functions need to load some state information from another service, which may result in unpredictable execution times (it is input-output and you may be loading a lot of data, unknowingly) and storing some state modification too (more unpredictable input-output). To be fair, input-output issues inside the AWS infrastructure seem exceedingly rare, but programming oversights like “pull half the RDS database” are not entirely uncommon.
An honorable mention in terms of Lambda issue goes to cold starts: as Amazon scales up the infrastructure that serves your Lambda functions, you may find some requests to have significantly higher latency (a.k.a., “response time”) than the rest because they were queued for processing to an instance that was not running yet, and is being initialized. Cold starts affect, disproportionately, use-cases where load comes in rapid bursts. To reduce the impact on latency-sensitive use-cases, frameworks like Serverless started providing means of keeping Lambda functions warm by invoking them regularly and, so doing, preventing AWS from shutting down the running instances. In December 2019, Amazon provided a capability called Provisioned Concurrency, which means you pay to keep functions warm, sacrificing in part or in whole the “scale-to-zero” aspect of AWS Lambda in exchange for more predictable latency.
Achieving Observability of AWS Lambda Functions
How Lambda makes traditional APM tools struggle
Traditional Application Performance Monitoring (APM) tools that leverage trace sampling or require manual instrumentation are ill-equipped to handle the unique challenges of monitoring Lambda, leaving too many monitoring gaps when applications include Lambda functions.
Sampling-based approaches consist in not recording every single trace, but rather some percentage, out of reasons of overhead for collecting the data, storing them, or very often both. While sampling is a legitimate strategy when applications consistently serve large workloads (say, thousands of requests a minute), the approximation and uncertainty it introduces is a no-go for most AWS Lambda monitoring use-cases. Sampling strategies that are not tailor-suited to the specifics of AWS Lambda, will fail to accurately represent the importance of cold starts on the overall end-user experience, especially for AWS Lambda functions that are not executed regularly or workloads in which scale-ups are a rather common occurrence.
Manual instrumentation consists of the developers of the Lambda function being responsible not only for the business logic, but also of the code that collects tracing data. In general, manual instrumentation requires significant toil and continuous work: as your code evolves, so must your instrumentation, continuously taking up precious developer resources. For the specific case of AWS Lambda functions, manual instrumentation can take up a significant amount of overall code. We’ve seen many instances where manual instrumentation was a significant portion (5% to 10%, between one line in ten or twenty!) of the overall code base, increasing overall development and maintenance costs.
AWS Lambda is not an island, so neither should its monitoring
As AWS Lambda shines in terms of integrating different systems together, one key aspect is that AWS Lambdas are seldom operating in isolation or in concern with only other functions. Rather, virtually every AWS Lambda environment monitored by Instana sees the Lambda functions invoking or being invoked by more traditional software components. In other words, no Lambda function is an island; instead, Lambda functions need to be monitored end-to-end with the rest of the software ecosystem surrounding them.
Lambda monitoring done right
Successful Lambda performance monitoring requires a rather different approach:
Trace every call, because each call can be uniquely important
Trace everything automatically, keep Lambda code lean
Trace all types of systems, not only Lambda
Enable tracing via easy configuration, irrespective of when the Lambda function was written or last modified
Instana’s Lambda monitoring and tracing delivers on all these points:
Extending Instana’s unmatched tracing capabilities to Lambda: trace every call, automatically, without measurable overhead, end-to-end with all other systems traced by Instana.
Delivering the instrumentation in a way that is both native to Lambda functions, and incredibly lightweight to set up. In a nutshell, we ship AWS Lambda Layers for the runtimes we support, and all it takes to use them is a few easy settings that can be configured via the AWS Console, the AWS Command Line Interface or by other automated means of deploying AWS Lambda functions, like CI/CD pipelines.
With Instana tracing, distributed complexity is easily visualized: as you see every trace and every function involved in serving it, understanding what goes wrong and in which component becomes straightforward. Instana’s Service Dependency map also serves as a blueprint of your actual architecture, serving as live documentation of how your system (of systems) is structured. Several customers of Instana even use the distributed tracing data in Instana to inform their architectural decisions, using data in Instana to assess coupling between components and to decide whether to continue with them as separate services or not.
^Instana’s Service Dependency Map(s) are always accurate, adjusting to all changes in real-time
Eliminate unpredictable observability costs
From our customers we have heard that using X-Ray as the tracing service for Lambda results in unpredictable costs. The X-Ray cost model based on consumption, both for collecting and retrieving the traces for analysis, and could get out of hand as application workloads increase significantly.
Instana’s Lambda monitoring doesn’t require X-Ray, which means you get to eliminate that operational cost completely. Instana’s Lambda monitoring price is based on how many functions you monitor, no matter how many requests they serve.
Distributed tracing of Lambda functions with Instana
Instana’s AutoTrace™ technology collects a distributed trace for every request. AutoTrace automatically instruments your service or functions without code modification on your part. Instana’s AutoTrace works with all your services, applications, and databases, no matter where or how they run.
^Instana’s view of a trace that includes Lambda and non-Lambda workloads
This means you can trace every request into your AWS Lambda functions and out into your services running on-prem, into your cloud providers or other SaaS offerings. Instana seamlessly captures all the necessary tracing information and delivers the correlated call information automatically.
To see how Instana’s automated Lambda tracing works in your environment, sign up for a free trial today.