The internet plays such a big part in our lives, but most people don’t think about the infrastructure supporting it—except, of course, when it fails.
Now, taking a business owner’s point of view, you can own all the software and hardware you need to provide your service, but it also includes regular maintenance. For example, in software, you need to add security fixes, patches, updates, make backups, and so on. As for hardware, you must consider electricity, network cabling, internet provider, securing your building, hardware failures, and more.
So, to leverage part of the maintenance tasks for running your application, you use a cloud provider, which will handle all the maintenance activities, leaving you to focus on your business. One of the most used is Amazon Web Services (AWS). AWS has the infrastructure your application needs, the security measures you must comply with, and most of the software to support your business.
Still, nothing is perfect, and even AWS may fail. However, one way to be prepared for when it happens is to monitor the services you use. AWS already provides some monitoring features, but they can either be too expensive or not provide the information in a way that’s easy to visualize to take proper action.
Here, I’ll show you how to create monitoring practices, what services to monitor, and what information to measure.
Monitoring Practices
There are a lot of elements involved in cloud services, and it’s easy to get lost. As recommended by AWS, keep in mind the following topics.
Monitoring Goals
What do you want to achieve by monitoring your services? Is it to make sure they’re up and running? That your services are performing correctly? Do you want to ensure resource consumption is according to your budget? Does your service present any failures, or are they critical?
Answering the above questions will help you focus on the elements that provide you the best insights into your service’s health. These answers will also guide you on what resources to monitor, how often, what tools to use, and how to get notified if an unexpected issue arises.
Resources to Monitor
Monitoring every link in a cloud service chain can be cumbersome. Additionally, it can involve a lot of additional resources, which you also have to pay for. So, you must focus on vital resources to ensure your services are running well.
For example, if your application saves customer data, first you must be sure your database is available. You also have to make sure the service to save data is also executing correctly. Finally, you also need to know your clients are able to connect to your service. Consequently, there are at least three resources to monitor: your database, your back-end server, and your front-end service (i.e., mobile app or web application front end).
Frequency
After you’ve chosen what you need to monitor, you must define how often to monitor it. If your service is available only during working hours in America, there’s no need to have it available—and even less to monitor it—outside of business hours. But if you have an online store, it must be available 24/7, all year long. In this case, you’ll have to monitor each resource with a granularity of minutes, or even seconds.
Tools
Since you’ve already defined what to monitor, and how often, now you need to choose the proper tools for the job. A ping will tell you the server where your database resides is working, but you won’t be able to infer your hard drive is full and your database can’t save any additional information. For every resource you’re monitoring, there must be a tool to report its health status.
Notifications
Finally, whenever there’s an incident or an unexpected health status, you need to notify someone who has to be available to acknowledge the notifications and must know how to react to the event. It’s not useful to find an email in the morning letting you know your online store has been down since 10 p.m. yesterday, and you don’t have the credentials for the production server.
AWS Services to Watch
AWS is one of the main cloud providers worldwide, so the service list is long. For all of them, AWS already provides a monitoring tool: Amazon CloudWatch.
Amazon CloudWatch is a monitoring service that provides data and insights to monitor applications, services, and resource utilization, among other features. It collects data as logs, metrics, and events to provide service health status.
As developers, we only use a subset of services—only the ones required in our projects. In my experience, the services I’ve used most are virtual machines, storage, APIs, functions, and databases. All of them connect to CloudWatch, and we can get enough information to determine our app health’s status.
Metrics by Service
API Gateway
This is a service for creating Representational State Transfer (REST) APIs. The metrics available from this service allow us to evaluate client-side errors, server-side errors, and API performance by measuring time from getting a request from a client and sending a response.
4XXError
Number of client-side errors (server responses with 4XX status).
5XXError
Number of server-side errors (server responses with 5XX status).
Count
Total number of API requests.
IntegrationLatency
Time between relaying a request to the back end and receiving a response from it.
Latency
Total time between receiving a request from a client and sending a response to it. This includes IntegrationLatency time.
Elastic Compute Cloud
This is a web service to provide compute capacity in the cloud—in other words, a virtual machine. The metrics in this service provide system usage and performance.
CPUUtilization
This is the percentage of allocated EC2 compute units currently in use by the instance. It’s the processing power your application is using.
DiskReadBytes
This is the number of bytes read from all instance store volumes available. This metric can help you determine an application’s speed.
DiskWriteBytes
This is the number of bytes written from all instance store volumes available. This metric can help you determine an application’s speed.
Elastic File System
This system provides storage for use with Elastic Compute in AWS. These metrics are useful to detect throughput and I/O threshold limits.
PercentIOLimit
This percentage shows how close a file system is to reaching I/O limits in general-purpose performance.
PermittedThroughput
This is the maximum amount of throughput a file system is allowed, given the file system size and BurstCreditBalance. Burst credits let a file system burst to throughput levels above its baseline level for periods of time.
Elasticsearch
This service helps to deploy and operate Elasticsearch clusters for log analytics, application monitoring, and clickstream analysis. These metrics help you identify a cluster’s index status, size, and documents available.
ClusterStatus.green, ClusterStatus.yellow, and ClusterStatus.red
These metrics indicate index shards status within the nodes in the cluster (green for all index shards allocated, yellow for only primary shards, red for missing indexes in primary or replica shards).
Nodes
This is the number of nodes in an Elasticsearch cluster.
SearchableDocuments
This is the number of searchable documents across all indices in the cluster.
FreeStorageSpace
As the name implies, it’s the free space for all data nodes in the cluster.
Lambda
AWS Lambda lets you execute code without manually configuring or provisioning any servers. These metrics help you identify errors and concurrency limits.
Errors
This measures the number of invocations that failed due to errors in the function (response code 4XX).
Throttles
This measures the Lambda function invocation attempts that were throttled due to invocation rates exceeding the customer’s concurrent limits.
Relational Database Service
This service is for configuring relational databases in the cloud. The metrics I mention here help you review database performance.
CPUUtilization
This is the percentage of CPU utilization.
DatabaseConnections
This is the number of database connections in use.
ReadThroughput and WriteThroughoutput
This is the average number of bytes read from and written to disk per second.
Simple Storage Service
This service provides storage to store and retrieve data from anywhere. This means you can store and retrieve simple files, applications, and static websites. Metrics here allow you to review service usage.
NumberOfObjects
This is the total number of objects stored in a bucket.
AllRequests
This is the number of HTTP requests made to a bucket.
BytesDownloaded
This is how many bytes were downloaded by requests made to a bucket.
BytesUploaded
This is how many bytes were uploaded by requests made to a bucket.
Summary
As I’ve explained above, sometimes it’s better to delegate infrastructure management to a third-party—in this case, AWS. Still, it doesn’t mean the service will never fail. You need to have the proper set of monitoring goals and tools to be prepared for such an eventuality. If you haven’t experienced the simple-to-install and easy-to-use SolarWinds Observability SaaS (formerly known as SolarWinds Observability), maybe now is the time. Sign up for a 30-day free trial.
This post was written by Juan Pablo Macias Gonzalez. Juan is a computer systems engineer with experience in back end, front end, databases, and systems administration.