The first edition was published in 2020, and with the pace of change being as brutal and unforgiving as it is, I started making notes for the second edition within a month of finishing the manuscript. The overall structure has remained the same, but I go far deeper into the different topics in the 2nd edition. There are also more visuals and a couple of new topics. This series of articles provides a summary of each of the chapters with some personal afterthoughts. Serverless Beyond the Buzzword 2nd edition can be purchased here: https://link.springer.com/book/10.1007/978-1-4842-8761-3
Written article continues below the video
Logging is an important aspect of any application development and is certainly not unique to Serverless architecture. However, the distributed and asynchronous nature of Serverless makes troubleshooting considerably more challenging if there is no logging strategy in place.
Proactive logging
Traditionally, application logging has been used in a reactionary mode; teams wait for an issue to be reported and then review the logs to identify the root cause. This means that remediation only happens after the damage has been done. Furthermore, not all users will report issues; some will simply leave and use another solution.
While regularly reviewing logs for insights is a good practice, doing so while trying to resolve issues is not ideal. This is even more so with Serverless, which has many components generating hundreds of logs in different places and formats. A better approach is to automate proactive logs analysis and respond to problems before they are reported. Proactive logging can be achieved with existing cloud services or, for more advanced actions, with custom microservices.
Creating logs is easy, and logging is often turned on by default or through a simple configuration option. For example, the console output from Lambda microservices will automatically be written to CloudWatch Logs. Here, a consideration is to ensure such logs are created and formatted consistently, making analysis and response easier. API Gateway and many other services can also write logs to CloudWatch Logs if so configured. The service determines the content and format of these logs.
We use proactive logging to react more quickly to issues and ideally remove the dependence on having a user report the issue. This enables the system to immediately analyse and respond to logs without waiting for user feedback. Most developers will be familiar with autoscaling, which follows a similar principle; we do not wait for a user to provide feedback that an application is slow before scaling. The metrics are monitored, and additional resources are added automatically when certain thresholds are crossed.
Proactive logging can help reduce the time needed to respond to incidents. As soon as an error happens in a given microservice, it will trigger an alert to the application team. It is near-real-time and does not wait for any user to report the error. The notification can include context and other details to help identify the root cause and resolve the issue.
Logging services such as CloudWatch Logs will capture and store logs. They have search or filtering functions and some visualisation as well. However, they will not respond to log events automatically unless this is configured. There are a few approaches, but two common ones are CloudWatch Alarms and CloudWatch Subscriptions.
AWS CloudWatch
CloudWatch is the most common AWS Service used for performance and activity monitoring of cloud resources. Out-of-the-box, CloudWatch handles the application and metric logs, stores the logs with encryption at rest, and provides features such as filters and dashboarding to help analyse logs.
CloudWatch Logs stores the application logs typically used for debugging and tracking microservice activity. CloudWatch metrics are managed and handled by AWS after you have enabled them for a particular service. We use them to monitor utilisation, identify bottlenecks and improve performance. With Serverless architecture, we are most interested in metrics such as Lambda memory utilisation and concurrency. Metrics such as CPU utilisation, uptime and disk storage are often less relevant for this architecture.
With CloudWatch Alarms, we can set thresholds and trigger notifications when the thresholds are crossed. These thresholds can be configured with a maximum limit, minimum limit, average and mathematical formulas such as percentiles, trimmed mean and windowed mean.
CloudWatch Subscriptions enables near-real-time analysis of log files. Once configured, CloudWatch streams incoming logs to a Lambda microservice. The microservice can analyse log contents for type, context, involved users, systems, and data. Compared to CloudWatch Alarms, microservices can do far more complex types of analysis, such as detecting personal data or potentially fraudulent behaviour. For example, the microservice could identify an error log event, track down the associated user or system, determine if any sensitive data is at risk, and detect any related anomalies. The microservice can then report everything to the application team or take more direct action via a cloud service API. Microservices with the right access permissions can block a user account, lock a data store, invalidate an encryption or API key, or roll an application build back to a previous version. In extreme cases, it could even take an application offline and replace it with a maintenance page.
Optimisation
AWS tracks performance metrics for most cloud services, and we can use these logs to optimise our application and architecture. For example, CloudWatch Alarms can be configured to send an alert when certain logs are identified. Many dropped requests might mean that we need to adjust throttling metrics in API Gateway, while high execution duration on Lambda microservices might mean we need to split our microservices into smaller ones or rightsize their memory configuration. We can also track latency, which can help identify bottlenecks in our architecture.
Testing
Testing is a fundamental aspect of any form of development. It ensures that the designed system behaves as expected, ideally in both typical and edge use cases. In a serverless context where there are many moving parts, testing plays an even greater role in ensuring the various integrations work well together.
For new deployments, deciding what to test is straightforward. Usually, all the tests are run. However, when deploying changes to an existing application, only specific components may have been updated. While we could still run all the tests, this can take time, impacting productivity. Ideally, we would only test the changed components, but this could mean issues are missed in dependent but unchanged components.
To avoid dependencies, a multi-repository approach to Serverless can mitigate this to some extent. Each repository has its own deployment pipeline in this approach, and tests only cover that deployment. A separate repository would contain integration tests between different implementations and end-to-end tests for entire workflows. These run less frequently and only after the individual repositories have passed all their own tests.
Another solution is personal development environments. This is relatively easy and cost-effective with a fully Serverless solution, as you are billed only based on actual utilisation. A private environment enables developers to run their tests without interference or dependencies on other developers. A CI/CD pipeline should be configured to deploy their branch to its own unique environment.
Integration testing
Integration tests are critical to Serverless architecture, given the reliance on integrations with various cloud services. These cover a more significant portion of the application and are also more fine-grained, enabling us to quickly find the cause of an issue. However, fine-grained testing can be costly to develop and potentially take a long time to run.
- Test for performance, latency, and bottlenecks that can be prioritised in optimisation efforts.
- They can be automated, although there is higher complexity. Additional components (microservices) and services (X-ray or Synthetics) must be configured and deployed to enable automation.
- Integration tests often need a lot more effort to develop, or at least a shared execution framework through which the individual tests can be executed.
- Generally, integration tests must be run on deployed infrastructure to ensure accurate and up-to-date service responses and error messages.
- They tend to need a lot of time to run. This includes the time required to deploy infrastructure and code changes, service run times, network latency, trigger latency, and various other factors.
- These tests incur additional costs due to running on resources deployed on the cloud.
AWS testing services
AWS Serverless Application Model (SAM) can help with testing microservices locally without needing to deploy anything. SAM creates a container and hosts your microservice code locally to emulate the Lambda service. SAM also enables displaying of cloud logs locally to help with troubleshooting.
Code Build can test code to ensure everything works before deployment. Tests that can be performed locally can usually also be performed in CodeBuild, fully automated, and as part of the deployment pipeline. Failed tests will see the deployment rejected and sent back to the developer, helping with quality control and reducing the chance of issues in production.
CloudWatch Synthetics is a fully managed service within CloudWatch that can create and automate end-to-end tests. These tests can be custom test scripts or pre-defined blueprints provided by AWS. Synthetics can run on demand, in response to an event in your environment, or on a fixed schedule.
Device Farm is a testing service that provides an extensive range of browsers and both virtual and actual mobile devices to test applications on. It offers the ability to run tests concurrently and fully automated across multiple platforms at once.
Check out the book's mini site for more information and ordering here: https://serverlessbook.co