When Google partnered with the Linux Foundation in 2015 to form the Cloud Native Computing Foundation (CNCF) and release the code for Kubernetes v1.0, few people foresaw that it would become so successful. Nevertheless, according to the 2020 CNCF Survey, the use of Kubernetes in production has increased from 78% in 2019 to 83% in 2020, reaffirming Kubernetes as the leading open source, container-orchestration platform. The widespread adoption of Kubernetes has created a need for monitoring systems that provide engineering teams with the metrics used to observe Kubernetes clusters proactively.
Importance of a GKE Monitoring Stack | The Basics
Given the flexibility and scalability of Kubernetes, a monitoring stack is crucial. Monitoring provides you with vital information, allowing you to make informed decisions and alerting you to potential problems. A Kubernetes cluster is incredibly complex and, while you need visibility into application performance and underlying infrastructure metrics to improve its efficiency and prevent problems, it’s difficult to manage this without a monitoring stack.
What exactly do the terms monitoring and metrics refer to? The best way to illustrate the meaning and scope of these terms is through examples.
A classic example of monitoring is to collect metrics on the status of resources used by applications to ensure that everything is working as expected. In GKE, a simple way to do this is from the integrated command console, known as Cloud Shell. To check the status of the Pods, you can use the <terminal inline>kubectl get pods<terminal inline> command.
GKE also comes ready with the <terminal inline>kubectl top<terminal inline> command, which allows you to check the resources used by the Pods using <terminal inline>kubectl top pods<terminal inline>.
While these metrics are helpful for a quick check, they’re a far cry from a full-featured monitoring solution that collects and displays data in real time. Fortunately, GKE has a monitoring and alert system that shows metrics in real time via a graphic interface.
Thanks to GKE Metric Explorer, you can monitor node health, resources used, disk performance, node load, ingress packets per VM, and several other Kubernetes-specific metrics.
These metrics have different use cases—for example, monitoring cluster performance and resource availability are crucial to know if the cluster needs to be scaled up, or if there’s a traffic bottleneck that requires revising the load balancer. In the following section, we’ll take a longer look at the features offered by the monitoring system included in GKE.
GKE Built-In Monitoring Tools
One of the main reasons that GKE is so popular is the number of metrics that can be monitored simply and conveniently. This section will help you understand the monitoring options built into GKE.
Cloud Operations for GKE Features
Google Cloud's operations suite consists of a group of built-in tools and services to enhance functionality.
Fully Managed Cloud Logging
The cornerstone of Google Cloud’s operations suite is its Cloud Logging, which automatically collects logs from all infrastructure and applications running on Google Cloud, including GKE. All the logs obtained by this highly scalable service can be conveniently accessed at any time in the Cloud Console.
From the Logs Explorer, you can see an overview of the logs, filter the results according to various criteria such as severity, resources, or log name, and even convert results to metrics and log-based alert policies. Moreover, Cloud Logging's flexibility allows you to use the Ops Agent, fluentd, or an API to collect data from custom sources such as applications, on-prem sources, or other clouds. Another advantage of this service is that it allows you to choose where to save the captured logs. This means you’re not limited to storing the logs in GCP Logs Storage—you can also export them to Google Cloud Storage, or stream them via Google Cloud Pub/Sub to any third-party provider.
Another convenient feature of Cloud Logging is the Logs Dashboard, where you can view information such as GKE container errors or GKE container logs as graphs. By hovering the mouse over an area, you can get information from the log, and even analyze it in detail.
The ease of viewing data and the ability to access logs from all sources in a single place makes Cloud Logging a powerful troubleshooting and analysis tool.
Similar to Cloud Logging, Cloud operations offers a complete solution Cloud Monitoring which provides visibility for both infrastructure and applications, whether they are hosted on premises, in Google Cloud, or in another cloud. Out of the box, Cloud Monitoring gives you an overview of the events of your Kubernetes cluster, as well as dashboards that put a wealth of information at your fingertips. If necessary, you can also define your own custom metrics and export them to third-party services for further analysis.
Cloud Monitoring also includes a handy GKE Dashboard that offers you a bird's-eye view of your Kubernetes clusters.
The true power of Cloud Monitoring, though, is how easy it makes it to navigate through the different metrics using the Metric Explorer, which allows you to analyze metrics in real time, identify correlations, problems, or points of interest, and add graphs with these metrics to any dashboard. Best of all, you can configure the charts visually or do it programmatically by using the Monitoring Query Language (MQL) built into the Metrics Explorer.
A monitoring system wouldn’t be complete without a way to define alerts. The Alerting menu offers you a summary of recent incidents, as well as the ability to create alert policies based on metrics of interest. You can also configure notification channels such as Cloud Mobile App, PagerDuty Services, PagerDuty Sync, Slack, Webhooks, Email, SMS, and Cloud Pub/Sub to receive these alerts.
Application Performance Management
Other tools included in the Cloud operations suite are Cloud Trace, which allows you to understand the flow of information and detect latency issues; Cloud Debugger, which enables you to inspect applications in real time without having to stop them; and Cloud Profiler, which constantly analyzes the performance of your code on each service to help you improve speed and keep costs under control. All of these tools leverage the power of Cloud Logging and Cloud Monitoring to offer a complete solution for managing the performance of your applications.
There’s also a section for error reporting. In it, you can view the most recent errors and their status, as well as configure notifications that offer the same options as the alerts.
Cloud Operations Configuration and Pricing
For new GKE clusters, Cloud Logging and Cloud Monitoring are enabled by default, though you can disable them from the Cloud Console or Google Cloud Console. Because these services are enabled by default, you can monitor your GKE clusters, manage your system, debug logs, and analyze cluster performance right out of the box.
Like most providers, Google Cloud operations suite offers a free monthly allotment for each of their services. Beyond that allotment, each service is billed separately, and pricing depends on how much data is processed. For more information about Cloud operations suite pricing, you can check the pricing documentation.
Google Cloud Managed Service for Prometheus
Google Cloud Managed Service for Prometheus is an optional Cloud Monitoring service that allows you to monitor and receive alerts for your workloads using Prometheus without having to worry about manually managing or scaling it. Google Cloud offers this fully managed storage and query service for Prometheus metrics in two modes: managed data collection and self-deployed data collection. For more information on the differences between these modalities, you can read the Prometheus documentation, but for the purposes of this article, you’ll focus on the managed service, because it’s the recommended option for GKE environments.
To get started with the managed collection, you’ll need to make sure that Google Cloud APIs and Services are enabled. You’ll also need to configure your environment using the <terminal inline>gcloud config set project PROJECT_ID<terminal inline> and <terminal inline>kubectl config set-cluster CLUSTER_NAME<terminal inline> commands. Once this is done, you can create a namespace and set up the managed collection using the following commands:
After these changes, the service will be ready to use, but you’ll still need to configure the<terminal inline>PodMonitoring<terminal inline> custom resource for ingesting metric data to Prometheus. The detailed procedure to configure this resource is found in the documentation. Once you’ve configured the target scraping and metrics ingestion using the <terminal inline>PodMonitoring<terminal inline> resource, you’ll have access to the Managed Service for Prometheus in the Cloud Monitoring menu.
Data Visualization in Grafana
Grafana is one of the most widely used, open source data visualization and analysis tools available today. Among other factors, its popularity is due to the fact that it can pull and process metrics from almost any source, which allows it to integrate smoothly with Prometheus to process massive amounts of data from your infrastructure.
Grafana can be found in the Google Cloud Marketplace. One of the easiest ways to integrate Grafana into your GKE cluster is by using Google Click to Deploy.
With the Grafana application installed, just follow the directions to redirect Grafana to your local machine using <terminal inline>kubectl port-forward<terminal inline>.
Once that’s done, you’ll have access to the Grafana UI and will be able to import or create dashboards using Prometheus metrics.
Enhancing GKE Monitoring with ContainIQ
While Google Cloud offers a robust monitoring and logging system, you might need more complex or specific information than what it offers. This is where solutions like ContainIQ come in. ContainIQ gives you Kubernetes monitoring and observability natively and instantly due in large part to the use of eBPF, which allows securely running sandboxed programs in the Linux kernel, eliminating latency problems.
ContainIQ offers you pre-built dashboards with metrics, Kubernetes events, cluster and application logs, as well as latency. The metrics dashboards include a Pod metrics dashboard and node metrics dashboard, which clearly display the status of each resource, the primary metrics, the limits, and the averages over time.
Although these dashboards offer information similar to that of Cloud Monitoring for GKE, they also offer a color-coded, easy-to-understand visual representation, the ability to add Pods and nodes from multiple clusters to the same view, and a frictionless way to create filters and alerts. In addition, users are able to quickly correlate metrics to events to logs.
Something similar could be said of the Kubernetes event dashboard, which offers you an overview of the cluster's historical events, as well as alerts, warnings, and the ability to filter events by severity.
ContainIQ’s logging feature set is helpful when debugging. ContainIQ stores and saves application logs from the cluster itself and from the applications. Users are able to view, sort, and track logs over time. As well as correlate events to logs at that given time. For example, a user could easily view the logs at a point in time leading up to a pod eviction or CrashLoopBackoff.
Finally, you have the latency dashboard, which can measure and monitor latency by microservice and URL path, and can filter this data by date range. A unique feature of this dashboard is that it works without the need to install application packages or middleware. A common functionality of all the ContainIQ dashboards is the ability to use data points or events to create alerts that can then be integrated into a Slack channel, with support for more integrations coming soon.
You can get ContainIQ up and running quickly once you’ve created an account, you’ll receive detailed implementation instructions in the form of a Helm chart or YAML file. Install the chart and you’re good to go.
ContainIQ costs $20 per node per month, plus $1 per GB of log ingest. You can sign-up for an account here.
In this article, we’ve explored how to monitor Kubernetes clusters with GKE using Google Cloud’s operations suite, Managed Prometheus, and Grafana. You’ve also seen how ContainIQ takes the monitoring and analysis of your Kubernetes cluster to a new level, thanks to unique visualizations that show you the metrics of your infrastructure in a clear, straightforward way, while also offering unique insights into the performance of your cluster.