What is a StatefulSet?

We can define it as the API object for workload and administration of stateful applications or also known as “stateful applications”. This manages the deployment and scaling of a set of pods. Thus offering guarantees on the order and uniqueness of the pods it manages.

Like deployments, a StatefulSet manages pods based on identical container specifications, but unlike deployments, a StatefulSet maintains a fixed identity for each of its pods. That is, these pods, although they have been created from the same specifications, are not interchangeable between them, they have a persistent identifier that is maintained before any rescheduling ((re)scheduling).

If we want to use storage volumes to achieve persistence in our workload, StatefulSets can be part of a good solution. Although these pods that belong to a Statefulset are susceptible to failures, their persistent identifiers will make it easier for us to join the existing volumes with the new pods that will replace the failed pods.

When to use a StatefulSet?

StatefulSets are especially useful in the following cases or requirements.

  • Unique and stable network identifiers.
  • Stable and persistent storage.
  • Deployment and scaling ordered and without affectation.
  • Automatic, continuous and orderly updates.

When we refer to applications that do not require a persistent identifier, orderly deployment, removals, or escalations. In that case, we should implement our application using workload objects that provide a set of stateless replicas. For example, a ReplicaSet or Deployment would be more efficient in such a case.

Limitations to be taken into account

  • The storage for a given pod must be provisioned through a “PersistentVolume” or persistent volume depending on the storage class requested, or previously by the administrator.
  • Deleting and/or shrinking a StatefulSet will not delete the volumes associated with it. Because we always try to guarantee the security and persistence of the data, usually more valuable than an automatic deletion of all the resources associated with the StatefulSet.
  • Currently, StatefulSets require a “Headless” service that must be responsible for the network identity of the deployed pods. In this case we must create this service on our side.
  • StatefulSets do not offer any guarantees in terms of pod termination due to StatefulSet removal. In order to guarantee an orderly termination without affecting the pods, we must use the possibility of scaling the StatefulSet to 0 before deleting it.
  • When working with rolling updates on top of the default pod management policies (OrderedReady), it is possible to receive an outage or broken status that requires manual intervention to fix.
Do you want to talk to us now?

Components of a Kubernetes StatefulSet

Let’s analyze the following example of a StatefulSet

The following example has been extracted from the official Kubernetes documentation and therefore does not reveal any type of configuration or sensitive data.

Remarks

  • Line 1-13: The “headless” service, called “nginx” and used to control the domain of the network. We can open a specific port and assign an alias to it.
  • Line 15-49: The StatefulSet named “web” has a “spec” on line 24 that tells us the number of replicas of the “nginx” container that will be launched in single pods.
  • Line 41-49: The “volumeClaimTemplate” will provide us with stable (persistent) storage using a “PersistentVolume” provisioned by the “PersistentVolume Provisioner”.

The name of the “StatefulSet” object must be a valid name for the sub-domain specified in DNS.

  • Line 20-22: We must specify a label, but note that it must be the same as the label specified in the “app” field on line 29. Not correctly labeling the “Pod selector” field will result in an error. validation error during the StatefulSet creation process.
  • Line 25: We can specify “minReadySeconds” inside “specs” which is an optional field and determines the seconds that should pass from the creation of a new pod running and in “ready” state without none of its containers are broken to be considered in an available or “available” state. This option is especially useful when we apply continuous updates (rolling updates) to check the progress of said update.

Pod identity

Pods that belong to a StatefulSet have a unique identifier made up of an ordinal (ordinal number), a persistent network identity, and persistent storage as well. The identity or identifier is associated with the pod no matter which node it is (re)scheduled on.

Ordinal index

For a StatefulSet with n replicas, each pod will be assigned an ordinal integer from 0 to n-1, unique in its set.

Persistent Network Identifier

Each pod in a StatefulSet derives its name from the name of the StatefulSet and the ordinal of the pod. The pattern for the constructed host(pod) name is $(statefulset name)-$(ordinal).

The StatefulSet can use a “headless” service to control the domain of its pods. This domain has the following nomenclature $(service name).$(namespace).svc.cluster.local, where “cluster.local” is the cluster domain.

As each pod is created, we get a matching DNS sub-domain, with the following form; $(podname).$(service domain), where the service domain is defined by the “serviceName” field in the StatefulSet.

Depending on the DNS configuration on your cluster, you may not be able to look up the DNS name for a newly launched pod. This behavior occurs when other clients in the cluster already sent queries to that pod’s hostname before it was created. Negative caching means that the results of previous failed lookups are remembered and reused, even after the pod has run.

So that this problem does not occur and we do not have a few seconds of crashes or errors after creating a pod, we can use one of the solutions described below.

  • Launch queries directly through the Kubernetes API (for example, using a clock) instead of relying on DNS lookups.
  • Reduce the caching time in your DNS provider in Kubernetes (usually set the “configmap” for CoreDNS, usually with a value of 30s).

Remember that as mentioned in the limitations section, we ourselves are responsible for creating the headless service responsible for the network identity of the pods.

Here are some examples of how the Cluster Domain, Service Name, and StatefulSet Name affect pod DNS.

Cluster domain Service Name StatefulSet name StatefulSet domain Pod DNS Pod hostname
cluster.local default/nginx default/web nginx.default.svc.
cluster.local
web-{0..N-1}.nginx.
default.svc.cluster.local
web-{0..N-1}
cluster.local foo/nginx foo/web nginx.foo.svc.cluster.
local
web-{0..N-1}.nginx.foo.
svc.cluster.local
web-{0..N-1}
kube.local foo/nginx foo/web nginx.foo.svc.kube.local web-{0..N-1}.nginx.foo.
svc.kube.local
web-{0..N-1}

Persistent storage

For each “VolumeClaimTemplate” entry defined in a StatefulSet, each pod receives a “PersistentVolumeClaim”. In the “nginx” example above, each pod receives a single PersistentVolume with StorageClass “my-storage-class“ and 1Gib of provisioned storage.

When a pod is (re)scheduled on a node, its “VolumeMounts” mount the “PersistentVolumes” associated with their respective “PersistentVolumesClaims”.

Pod Name label

When the “Controller” of the StatefulSet creates a pod, it adds a tag, “statefulset.kubernetes.io/pod-name”, this is set to the name of the pod. This same tag allows us to associate a service to a specific pod in the StatefulSet.

Deployment and Scaling Guarantees

If we have a StatefulSet with n replicas, when the pods are being deployed, they are deployed sequentially in the following order {0..N-1}.

When the pods are being eliminated, they do so in reverse, that is, from {N-1..0}.

Before a scaling operation is applied to a pod, all of the pod’s predecessors must be in a “Running” and “Ready” state.

Before a pod is terminated, all of its successors must be shut down completely.

We must never specify a value of 0 in “pod.Spec.TerminationGracePeriodSeconds”. This practice is insecure and never recommended.

Update Strategies

The .spec.updateStrategy field allows us to configure or disable automatic rolling updates for the containers, tags, requests and limits resources, and also annotations for the pods in a StatefulSet. We have two possible values.

OnDelete: When we set the type of “updateStrategy” to “OnDelete”, the StatefulSet controller will not automatically update pods belonging to pods. Users must manually delete the pods to force the controller to create new pods that reflect the modifications made to the StatefulSet template.

RollingUpdate: This value implements automatic updates, by default this is the value set for the update strategy.

Rolling Updates

When the update strategy (spec.updateStrategy.type) of a StatefulSet is set to RollingUpdate, the StatefulSet controller will delete and recreate each pod in the StatefulSet. It proceeds in the same order as when we finish pods (from largest ordinal to smallest). Updating each pod one at a time.

The kubernetes controlplane waits until an upgraded pod is running and ready before upgrading to its predecessor. The controlplane respects the value that we have assigned to “spec.minReadySeconds” since the pod changes its state to “ready” before continuing to update the rest of the pods.

Partitions

The update strategy “rolling update” allows us to do the update partially, if we specify a partition (spec.updaterStrategy.rollingUpdate.partition) all the pods that have an ordinal greater than the value of the partition will be updated when the template (spec .template) of the StatefulSet is modified.

The pods that have an ordinal less than the value of the partition will not be updated and even if they are deleted and redeployed they will do so with the previous version of the StatefulSet template. If the partition value is greater than the number of replicas (spec.replicas) no pod will be updated.

Force back

When using the default “OrderedReady” update strategy, it is possible to enter a failed state where manual intervention is required to fix it. If we update a pod’s template and our configuration never makes its state “running” or “ready”, the StatefulSet will stop updating and wait.

This state is not enough to be able to revert the template pod to a good configuration. Due to a known issue, the StatefulSet will continue to wait for the broken pod to change its state to “ready”, which will never happen, before attempting to revert back to the working configuration.

Once the template has been reverted, we must remove the pods from the StatefulSet that have attempted to be deployed with the wrong configuration. Once this is done the same StatefulSet will start to recreate the pods using the template pulled back to the correct configuration.

Replicas

The optional field “spec.replicas” specifies the desired number of pods, the default value is 1.

If we have to automatically scale a deployment, we can directly use the kubectl command or modify the “.yml” file and redeploy the StatefulSet. With the kubectl command:

kubectl scale statefulset statefulset –replicas=X

If we have a “HorizontalPodAtuoscaler” or a similar api to scale automatically, we should not specify this field since the Kubernetes control plane will act for us.