Multi-cluster monitoring and alerting
How to integrate metrics to a Go application and how to act on those metrics.
Metrics and monitoring are essential for ensuring the health and performance of Go applications. Metrics are data points that represent the state of an application, such as CPU usage, memory usage, and requests per minute. Monitoring is the process of collecting, storing, and analyzing metrics to identify problems and trends. By monitoring their applications, developers can proactively identify and fix problems before they impact users.
The D2iQ Kubernetes Platform (DKP) comes with monitoring stack components consisting of Prometheus, which collects metrics, Grafana, which can visualize and present the metrics, Alertmanager, which handles acting on metrics and sending alerts to 3rd party services, and Thanos which handles multi-cluster functionality of Prometheus. These components are configured to provide monitoring and alerting for applications launched on DKP.
How to Emit Metrics From Go Applications
To emit metrics from a Go application, use the Prometheus library, which provides a number of helpful functions. There are multiple types of metrics that applications can emit. The type of metrics defines how Prometheus will interpret the data.
Prometheus has four types of metrics:
- Counters: Counters are metrics that can only increase or reset to zero. They are typically used to track things like the number of requests, the number of errors, and the number of bytes transferred.
- Gauges: Gauges are metrics that can go up and down. They are typically used to track things like the current memory usage, the current CPU usage, and the number of concurrent connections.
- Histograms: Histograms are used to track the distribution of values. They are typically used to track things like the response time of requests, the size of requests, and the number of errors.
- Summaries: Summaries are similar to histograms, but they also track the quantiles of the distribution. They are typically used to identify outliers and understand the tail of the distribution.
An example Go application that would expose an /increment
HTTP handler and increase a counter metric looks
like this:
|
|
For more details see the Prometheus tutorial on how to integrate the prometheus library and emit metrics.
How to Integrate Metrics With DKP
Prometheus uses a pull model to collect metrics. This means that Prometheus actively fetches metrics from the targets,
rather than the targets pushing metrics to Prometheus. In the example above the prometheus handler only creates an HTTP
handler on the /metrics
path but it is not actively pushing collected metrics anywhere.
When an application runs in a Pod on a Kubernetes cluster it is usually exposed via Service
resource that can expose
Pods networking ports to the rest of the cluster. The Prometheus operator that comes with DKP exposes custom resource types
CRD ServiceMonitor
and PodMonitor
that should be used to configure prometheus instance to include a particular
Kubernetes service into scraping targets from which Prometheus will read the metrics data.
When creating a ServiceMonitor
, it is necessary to include the prometheus.kommander.d2iq.io/select: "true"
label on
the resource. Based on this label, the default instance of Prometheus installed on DKP will include the ServiceMonitor
configuration. The Prometheus operator is then allowed to run multiple Prometheus instances and the label selector is used to
associate service monitors with a Prometheus instance.
In the example below the ServiceMonitor
instructs Prometheus to scrape data from the Service
that matches the
label app: my-go-app
on port http
and the /metrics
path.
|
|
To confirm that metrics are scraped by Prometheus visit https://<CLUSTER_DOMAIN>/dkp/prometheus/graph
and enter
my_app_requests_total
in the console to see the metrics.
Alerting
Alerting is the process of notifying users when a metric or set of metrics exceeds a predefined threshold. This can be used to proactively identify problems before they impact users. DKP comes with an Alertmanager installation that can be used to receive alerts from Prometheus and send them to a variety of notification channels, such as email, Slack, or PagerDuty. Alertmanager itself is not creating any alerts, but rather only propagates and routes information based on routing rules. Alerts are created by Prometheus by continuously evaluating metrics values and creating alerts by calling the Alertmanager API.
To create an alert definition, use the PrometheusRule
resource:
|
|
The Prometheus operator will apply this configuration to the default Prometheus instance, which is configured to push alerts to
Alertmanager. When the number of requests goes over 5, Prometheus will create a new alert by calling the Alertmanager
HTTP API. The PrometheusRule
can be created on DKP management or attached clusters.
Multi-Cluster Monitoring and Alerting
The DKP metrics stack is configured by default to collect metrics across all managed clusters and have them available in the central management DKP cluster. For that purpose, DKP comes with Thanos which is a tool that can be used to extend Prometheus’s capabilities for collecting and storing metrics across multiple clusters. Thanos Query component can be used to query metrics from multiple Prometheus servers in a single place. This makes it possible to view and analyze metrics from all of the clusters in a single place.
To make alerting possible on metrics from all clusters, it is necessary to enable Thanos Ruler component. Thanos Ruler is a component of Thanos that can be used to evaluate Prometheus recording and alerting rules against a chosen query API, and then send the results directly to remote storage. It can be used to make alerting possible across multiple clusters by evaluating rules against a central query API that aggregates data from all of the clusters.
To enable Ruler on the DKP management cluster, add the following configmap:
|
|
And apply the override to the kube-prometheus-stack
AppDeployment
.
The configuration above will deploy the Thanos Ruler on the DKP management cluster, it will expose its UI on the
https://<CLUSTER_DOMAIN>/dkp/ruler
URL, and it will limit the rules only with role: thanos-alerts
to be used by
Ruler. The Ruler is configured the same way as Prometheus using the PrometheusRule
resource. Limiting the configuration to
the specific label allows to select which configuration will be applied to the Prometheus and which will be applied to
the Thanos Ruler. The exact same PrometheusRule
resources can now create alerts for data coming from multiple
clusters.
The final result looks like this.
Conclusion
Multi-cluster monitoring is an important feature of the DKP platform because it allows you to monitor multiple Kubernetes clusters from a single pane of glass. This can help you to identify and troubleshoot problems that affect multiple clusters, and to plan for the capacity needs of your clusters. DKP gives administrators flexibility to define and deploy various monitoring and alerting configurations per cluster or in a single centralized location.