Start your free 14-day ContainIQ trial

GKE Monitoring | Best Practices & Tools to Use

July 3, 2022

Monitoring is essential, but can be challenging, especially in distributed environments. This article will walk you through the nuances of monitoring a Kubernetes cluster deployed on GKE.

Damaso Sanoja
Engineer

When Google partnered with the Linux Foundation in 2015 to form the Cloud Native Computing Foundation (CNCF) and release the code for Kubernetes v1.0, few people foresaw that it would become so successful. Nevertheless, according to the 2020 CNCF Survey, the use of Kubernetes in production has increased from 78% in 2019 to 83% in 2020, reaffirming Kubernetes as the leading open source, container-orchestration platform. The widespread adoption of Kubernetes has created a need for monitoring systems that provide engineering teams with the metrics used to observe Kubernetes clusters proactively. 

This article will guide users on how to monitor a Google Kubernetes Engine (GKE) cluster, a managed environment designed by Google to run Kubernetes clusters.

Importance of a GKE Monitoring Stack | The Basics

Given the flexibility and scalability of Kubernetes, a monitoring stack is crucial. Monitoring provides you with vital information, allowing you to make informed decisions and alerting you to potential problems. A Kubernetes cluster is incredibly complex and, while you need visibility into application performance and underlying infrastructure metrics to improve its efficiency and prevent problems, it’s difficult to manage this without a monitoring stack.

What exactly do the terms monitoring and metrics refer to? The best way to illustrate the meaning and scope of these terms is through examples. 

A classic example of monitoring is to collect metrics on the status of resources used by applications to ensure that everything is working as expected. In GKE, a simple way to do this is from the integrated command console, known as Cloud Shell. To check the status of the Pods, you can use the <terminal inline>kubectl get pods<terminal inline> command.

Get pods
Get pods

GKE also comes ready with the <terminal inline>kubectl top<terminal inline> command, which allows you to check the resources used by the Pods using <terminal inline>kubectl top pods<terminal inline>.

kubectl top pods
kubectl top pods

While these metrics are helpful for a quick check, they’re a far cry from a full-featured monitoring solution that collects and displays data in real time. Fortunately, GKE has a monitoring and alert system that shows metrics in real time via a graphic interface.

VM instances in GKE
VM instances in GKE

Thanks to GKE Metric Explorer, you can monitor node health, resources used, disk performance, node load, ingress packets per VM, and several other Kubernetes-specific metrics.

Metrics explorer
Metrics explorer

These metrics have different use cases—for example, monitoring cluster performance and resource availability are crucial to know if the cluster needs to be scaled up, or if there’s a traffic bottleneck that requires revising the load balancer. In the following section, we’ll take a longer look at the features offered by the monitoring system included in GKE.

GKE Built-In Monitoring Tools

One of the main reasons that GKE is so popular is the number of metrics that can be monitored simply and conveniently. This section will help you understand the monitoring options built into GKE.

Cloud Operations for GKE Features

Google Cloud's operations suite consists of a group of built-in tools and services to enhance functionality.

Fully Managed Cloud Logging

The cornerstone of Google Cloud’s operations suite is its Cloud Logging, which automatically collects logs from all infrastructure and applications running on Google Cloud, including GKE. All the logs obtained by this highly scalable service can be conveniently accessed at any time in the Cloud Console.

Log explorer
Log explorer

From the Logs Explorer, you can see an overview of the logs, filter the results according to various criteria such as severity, resources, or log name, and even convert results to metrics and log-based alert policies. Moreover, Cloud Logging's flexibility allows you to use the Ops Agent, fluentd, or an API to collect data from custom sources such as applications, on-prem sources, or other clouds. Another advantage of this service is that it allows you to choose where to save the captured logs. This means you’re not limited to storing the logs in GCP Logs Storage—you can also export them to Google Cloud Storage, or stream them via Google Cloud Pub/Sub to any third-party provider.

Logs storage
Logs storage
Logs router
Logs router

Another convenient feature of Cloud Logging is the Logs Dashboard, where you can view information such as GKE container errors or GKE container logs as graphs. By hovering the mouse over an area, you can get information from the log, and even analyze it in detail.

Logs dashboard
Logs dashboard
Logs detail
Logs detail

The ease of viewing data and the ability to access logs from all sources in a single place makes Cloud Logging a powerful troubleshooting and analysis tool.

Cloud Monitoring

Similar to Cloud Logging, Cloud operations offers a complete solution Cloud Monitoring which provides visibility for both infrastructure and applications, whether they are hosted on premises, in Google Cloud, or in another cloud. Out of the box, Cloud Monitoring gives you an overview of the events of your Kubernetes cluster, as well as dashboards that put a wealth of information at your fingertips. If necessary, you can also define your own custom metrics and export them to third-party services for further analysis.

Cloud monitoring overview dashboard
Cloud monitoring overview dashboard

Cloud Monitoring also includes a handy GKE Dashboard that offers you a bird's-eye view of your Kubernetes clusters.

GKE dashboard
GKE dashboard

The true power of Cloud Monitoring, though, is how easy it makes it to navigate through the different metrics using the Metric Explorer, which allows you to analyze metrics in real time, identify correlations, problems, or points of interest, and add graphs with these metrics to any dashboard. Best of all, you can configure the charts visually or do it programmatically by using the Monitoring Query Language (MQL) built into the Metrics Explorer.

Cloud monitoring metrics explorer
Cloud monitoring metrics explorer

A monitoring system wouldn’t be complete without a way to define alerts. The Alerting menu offers you a summary of recent incidents, as well as the ability to create alert policies based on metrics of interest. You can also configure notification channels such as Cloud Mobile App, PagerDuty Services, PagerDuty Sync, Slack, Webhooks, Email, SMS, and Cloud Pub/Sub to receive these alerts.

Cloud monitoring alerting
Cloud monitoring alerting

Application Performance Management

Other tools included in the Cloud operations suite are Cloud Trace, which allows you to understand the flow of information and detect latency issues; Cloud Debugger, which enables you to inspect applications in real time without having to stop them; and Cloud Profiler, which constantly analyzes the performance of your code on each service to help you improve speed and keep costs under control. All of these tools leverage the power of Cloud Logging and Cloud Monitoring to offer a complete solution for managing the performance of your applications.

APM
APM

There’s also a section for error reporting. In it, you can view the most recent errors and their status, as well as configure notifications that offer the same options as the alerts.

Error reporting
Error reporting

Cloud Operations Configuration and Pricing

For new GKE clusters, Cloud Logging and Cloud Monitoring are enabled by default, though you can disable them from the Cloud Console or Google Cloud Console. Because these services are enabled by default, you can monitor your GKE clusters, manage your system, debug logs, and analyze cluster performance right out of the box. 

Like most providers, Google Cloud operations suite offers a free monthly allotment for each of their services. Beyond that allotment, each service is billed separately, and pricing depends on how much data is processed. For more information about Cloud operations suite pricing, you can check the pricing documentation.

Google Cloud Managed Service for Prometheus

Google Cloud Managed Service for Prometheus is an optional Cloud Monitoring service that allows you to monitor and receive alerts for your workloads using Prometheus without having to worry about manually managing or scaling it. Google Cloud offers this fully managed storage and query service for Prometheus metrics in two modes: managed data collection and self-deployed data collection. For more information on the differences between these modalities, you can read the Prometheus documentation, but for the purposes of this article, you’ll focus on the managed service, because it’s the recommended option for GKE environments.

To get started with the managed collection, you’ll need to make sure that Google Cloud APIs and Services are enabled. You’ll also need to configure your environment using the <terminal inline>gcloud config set project PROJECT_ID<terminal inline> and <terminal inline>kubectl config set-cluster CLUSTER_NAME<terminal inline> commands. Once this is done, you can create a namespace and set up the managed collection using the following commands:


kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.1.1/examples/setup.yaml
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.1.1/examples/operator.yaml

After these changes, the service will be ready to use, but you’ll still need to configure the<terminal inline>PodMonitoring<terminal inline> custom resource for ingesting metric data to Prometheus. The detailed procedure to configure this resource is found in the documentation. Once you’ve configured the target scraping and metrics ingestion using the <terminal inline>PodMonitoring<terminal inline> resource, you’ll have access to the Managed Service for Prometheus in the Cloud Monitoring menu.

Managed Prometheus
Managed Prometheus

In the graphical interface, you can use PromQL to run queries in real time. Although having Prometheus is great, it’s even better when combined with Grafana.

Data Visualization in Grafana

Grafana is one of the most widely used, open source data visualization and analysis tools available today. Among other factors, its popularity is due to the fact that it can pull and process metrics from almost any source, which allows it to integrate smoothly with Prometheus to process massive amounts of data from your infrastructure.

Grafana can be found in the Google Cloud Marketplace. One of the easiest ways to integrate Grafana into your GKE cluster is by using Google Click to Deploy.

Grafana setup
Grafana setup

With the Grafana application installed, just follow the directions to redirect Grafana to your local machine using <terminal inline>kubectl port-forward<terminal inline>.

Grafana
Grafana

Once that’s done, you’ll have access to the Grafana UI and will be able to import or create dashboards using Prometheus metrics.

Enhancing GKE Monitoring with ContainIQ

While Google Cloud offers a robust monitoring and logging system, you might need more complex or specific information than what it offers. This is where solutions like ContainIQ come in. ContainIQ gives you Kubernetes monitoring and observability natively and instantly due in large part to the use of eBPF, which allows securely running sandboxed programs in the Linux kernel, eliminating latency problems.

ContainIQ offers you pre-built dashboards with metrics, Kubernetes events, cluster and application logs, as well as latency. The metrics dashboards include a Pod metrics dashboard and node metrics dashboard, which clearly display the status of each resource, the primary metrics, the limits, and the averages over time.

ContainIQ Pod Metrics Dashboard

Although these dashboards offer information similar to that of Cloud Monitoring for GKE, they also offer a color-coded, easy-to-understand visual representation, the ability to add Pods and nodes from multiple clusters to the same view, and a frictionless way to create filters and alerts. In addition, users are able to quickly correlate metrics to events to logs.

ContainIQ Event Dashboard

Something similar could be said of the Kubernetes event dashboard, which offers you an overview of the cluster's historical events, as well as alerts, warnings, and the ability to filter events by severity.

ContainIQ’s logging feature set is helpful when debugging. ContainIQ stores and saves application logs from the cluster itself and from the applications. Users are able to view, sort, and track logs over time. As well as correlate events to logs at that given time. For example, a user could easily view the logs at a point in time leading up to a pod eviction or CrashLoopBackoff.

ContainIQ Latency Dashboard

Finally, you have the latency dashboard, which can measure and monitor latency by microservice and URL path, and can filter this data by date range. A unique feature of this dashboard is that it works without the need to install application packages or middleware. A common functionality of all the ContainIQ dashboards is the ability to use data points or events to create alerts that can then be integrated into a Slack channel, with support for more integrations coming soon. 

You can get ContainIQ up and running quickly once you’ve created an account, you’ll receive detailed implementation instructions in the form of a Helm chart or YAML file. Install the chart and you’re good to go.

ContainIQ costs $20 per node per month, plus $1 per GB of log ingest. You can sign-up for an account here.

Final Thoughts

In this article, we’ve explored how to monitor Kubernetes clusters with GKE using Google Cloud’s operations suite, Managed Prometheus, and Grafana. You’ve also seen how ContainIQ takes the monitoring and analysis of your Kubernetes cluster to a new level, thanks to unique visualizations that show you the metrics of your infrastructure in a clear, straightforward way, while also offering unique insights into the performance of your cluster. 

Start your free 14-day ContainIQ trial
Start Free TrialBook a Demo
No card required
Damaso Sanoja
Engineer

Damaso has been in the automotive/IT world since the age of 14, when his father decided to buy him a Commodore computer. Years later, his passion for electronics, computer science, and automotive mechanics motivated him to graduate in Mechanical Engineering from Universidad Metropolitana. For years, Damaso specialized in software engineering and networks. Today, Damaso is doing what he loves the most: writing engaging content for engineers and DevOps professionals looking to advance their careers

READ MORE