Etcd Performance Benchmarking
Ensure your etcd server is running on reliable storage
Is Your Etcd Fast Enough?
When you install the Kubernetes management platform of your choice, there are certain minimal hardware requirements you need to meet. At D2iQ, for example, we have these requirements.
These hardware recommendations provide a solid starting point, but they don’t necessarily reflect the actual performance of the hardware. We never seem to question the hardware’s performance, possibly because we trust modern technologies to perform as expected, or maybe because we trust the cloud provider and the provided specfifications (for an example, see aws recommendations).
However, the actual performance might differ depending on the CPU’s brand, frequency, and other specifications. For example, if a platform provider suggests setting up two CPU cores and two GB of RAM, those exactly same resources could perform differently depending on the CPU brand and type.
Most likely, as you are reading this article, you already have your cluster up and running. So you are not really exploring other hardware options, but you rather want to ensure that your existing hardware is performing at the expected level. In this particular article, our focus is storage in the context of Kubernetes itself. You may have different stateful applications like MySQL, but those are beyond the scope of this article.
Etcd
etcd(pronounced et-see-dee) is the primary datastore of Kubernetes. It is a critical component that stores all Kubernetes resources in a cluster, therefore it is very important that etcd operations are performed at an ideal speed. Having an etcd instance with poor performance is a clear indicator that your customer’s experience is significantly being impacted.
If you see the following or similar messages in your etcd server logs, it is important that you do not ignore them:
etcdserver: read-only range request … took too long (xxxx) to execute
Such messages are indicators that etcd is not performing well and based on etcd’s official documentation, this is usually caused by:
- Contention between etcd and other apps
- Slow disk
- CPU starvation
Benchmarking via Etcd Metrics
For real time monitoring and debugging, you can use etcd metrics. etcd reports metrics to Prometheus that can help you distinguish between the previous cases:
- wal_fsync_duration_seconds
- backend_commit_duration_seconds
The first metric is reported before applying changes to a disk, and the second one is reported after applying changes to a disk. It does not matter which one you choose, just keep in mind that high values for these metrics mean high disk operation latencies and disk issues.
As the etcd documentation suggests, the 99th percentile duration should be less than 25 ms for storage to be considered fast enough.
Benchmarking via Fio
If you are running etcd on Linux machines, another way to benchmark your storage performance is to use Fio, a very popular package to simulate I/O workload.
Installation
Ubuntu
Centos 7
openSUSE and SLE
You can install Fio from the official page here.
Other distributions To install Fio on other operating systems, visit Fio github page and select your binary from the list.
Benchmarking
Let’s create a new directory and name it test-dir
under the storage device you want to test. Then run the following command:
|
|
The following output is an example from an etcd node of a D2iQ cluster running on an AWS ec2 instance of type m5.xlarge
. Check the 99th percentile of fdatasync
!
You can see that the 99th percentile is 9138 or about 9.2ms of latency, which is an acceptable latency.
To understand how the different flags work or to explore other use cases, see the Fio command line options guide.