Monitoring user-facing software has always been important, no doubt about it. But with the unrelenting migration of software to the cloud, and the adoption of microservices and serverless architectures such as Function as a Service (FaaS), monitoring is now business-critical. These new ways of building modern software involve many moving parts, and developers need to see through the complexity to quickly diagnose performance and functional issues. Doing that before your users notice requires the right tools.
Unfortunately, it’s not as if you can drop in your legacy monitoring solution as a replacement. There are special challenges for monitoring in the cloud because of the fundamental differences compared with using physical servers, such as not always having access to the hardware your apps and services are running on.
Before we get into the types of tools you can use to monitor your software in the cloud, there’s a question we need to answer: how exactly is cloud monitoring different?
How Is Cloud Monitoring Different?
It may be easy to deploy software into the cloud, but you need to think strategically about how you’re going to monitor not just the services and apps, but also the infrastructure and platforms that host them. Remember, you won’t have physical access to the servers running the software, so it’s important to have the right tools in place to understand what your users are experiencing.
Legacy monitoring tools require configuration as new servers come and go, but that just won’t cut it in the cloud. In order to make sure your cloud-based app is still accessible, isn’t running slowly, and functions the way you expect, your monitoring tools must be able to handle one of the fundamental benefits of cloud computing: easy scaling. These tools need to automatically adjust and continuously monitor cloud resources, even as they come and go.
Considering all of these requirements, we can divide effective cloud monitoring into three distinct tools: those that track page speed and availability, app performance, and log analysis.
Page Speed And Availability
In a way, cloud apps are a victim of their own success. Because users can access the software anytime and anywhere, outages and availability issues cause customer service nightmares. Ensuring that your web site is always accessible can mean the difference between a good product and a great one. For this reason, whichever cloud monitoring tool you pick needs to frequently check the availability of your sites.
On the plus side, whether your sites are available as a binary answer — either a group of users can access them or they can’t. A more subtle problem is unpacking your visitors’ user experience, which is affected by every aspect of your site — from page load times to broken login portals. Catching problems here involves recording the steps your users take while going through your site, and inspecting page elements to find which ones are loading slowly or not at all.
Fortunately, there are tools that make this process easier. Pingdom is a cloud-based tool that monitors the availability of your sites and notifies you when your site goes down, when page content changes, or when HTTP error status codes are triggered. Using its dashboard, you can see at a glance which of your apps are experiencing issues. Crucially for analyzing page speed, the dashboard also allows you to break down page requests into individual steps so you can identify the latency bottlenecks annoying your users.
App Performance
Monitoring page speed and availability gives you a high-level understanding of how your site is performing. But to drill down and view things at a finer grain you need data directly from your web app, and for many products, app performance monitoring is the only option for truly understanding your app’s health. Rather than the external view that page speed provides, log performance analysis shows how the app’s internals are working.
Traditionally, understanding your app’s performance has involved tracking things like CPU utilization, memory consumption, and other hardware resources. These metrics are still a valuable source of insight into app behavior, and many cloud platforms record resource performance metrics while your software is running on top. For example, AWS records CPUUtilization, which shows the percentage of compute units currently in use on an instance, and can be a critical piece of the puzzle when chasing performance bottlenecks.
But resource performance data can only tell you so much about your app. If your service or app is running on a FaaS platform — such as Amazon AWS Lambda — you have no way to monitor those aspects since, effectively, you have no visibility into which compute resources are being used.
In that case, you need to emit performance data from directly inside your code. Integration with app performance tools is key so that your monitoring tool can easily digest the metrics you send it, whether they’re custom-generated metrics in your app or derived from the runtime, such as .NET, or database tools such as MongoDB.
One example of an application performance management and server monitoring tool that supports all of the features mentioned above is AppOptics. This tool allows you to monitor both your cloud infrastructure and application, and collect metrics to quickly identify performance issues and bottlenecks. To handle a range of deployed software, AppOptics includes over 150 integrations and plugins for popular languages, frameworks, and platforms.
Log Analysis
While app, infrastructure, and performance monitoring deal with metrics — those values that describe efficiency or speed — log monitoring provides a richer way to understand your software’s behavior. Frequently, log messages are created in the app itself because that’s often the best place to detect anomalous conditions and generate helpful diagnostic messages. For example, contextual messages can be generated in error handlers to aid with troubleshooting.
Since the cloud comes with the benefit of auto-scaling, you’re more than likely to be running your code on multiple hosts, or as a collection of services if you’re using microservices. That means you’ll have multiple log files to collect before you can analyze things holistically, which you can accomplish through log aggregation — the process of collecting multiple log files into a single location and merging the log records together.
Aggregating your logs makes analysis much easier since you can see the entire picture of your software’s behavior at once, instead of looking at each host’s log file individually and having to piece together the whole.
But once you have the logs aggregated, searching through that sea of data can be challenging. Cloud monitoring tools also offer searching and filtering capabilities to help you home in on the data you’re looking for, even when the amount of data might overwhelm traditional tools.
One such tool is Papertrail — a log file analysis tool that provides a central location for aggregating logs. Using its single view, you can diagnose issues no matter where they are in your infrastructure before they affect your users. Papertrail also provides advanced searching and filtering, and a live tail feature so that you can pause, search, and scroll through log messages in real time.
Adapt To The Cloud
Cloud-based software is now ubiquitous, and users are coming to expect the always-on nature of apps in the cloud. Choosing the right mixture of page speed, app performance, and log analysis tools is key. Each must be designed to handle the fundamentals of running your software in the cloud: automatic scaling, high availability, and microservice architectures.
Ultimately, continuously monitoring your services, apps, and infrastructure is necessary for providing great user experiences and giving your users what they want — access to your software whenever and wherever.