Demystifying Kubernetes Scheduling: A Deep Dive into Predicates and Priorities

Demystifying Kubernetes Scheduling: A Deep Dive into Predicates and Priorities

Today's article primarily introduces the stages in the scheduling process where the scheduling strategies of Predicates and Priorities mainly come into play.

Predicates

Let's first take a look at Predicates.The role of Predicates in the scheduling process can be understood as a Filter, that is: it "filters" a series of nodes that meet the conditions from all the nodes in the current cluster according to the scheduling policy. These nodes are all potential hosts for the Pod that is waiting to be scheduled.In Kubernetes, there are four default scheduling policies.

The first type is called GeneralPredicates.

As the name suggests, this set of filtering rules is responsible for the most basic scheduling policies. For example, PodFitsResources calculates whether the CPU and memory resources of the host are sufficient.Of course, as I mentioned earlier, PodFitsResources only checks the Pod's requests field. It should be noted that the Kubernetes scheduler does not define specific resource types for hardware resources such as GPUs, but instead uses a Key-Value format called Extended Resource to describe them. For example:

apiVersion: v1
kind: Pod
metadata:
  name: extended-resource-demo
spec:
  containers:
  - name: extended-resource-demo-ctr
    image: nginx
    resources:
      requests:
        alpha.kubernetes.io/nvidia-gpu: "2"
      limits:
        alpha.kubernetes.io/nvidia-gpu: "2"

It can be seen that our Pod declares the use of two NVIDIA-type GPUs with the definition method alpha.kubernetes.io/nvidia-gpu=2.In PodFitsResources, the scheduler does not actually know that the meaning of this field Key is GPU, but directly uses the subsequent Value for calculation. Of course, in the Node's Capacity field, you also need to add the total number of GPUs on this host, such as: alpha.kubernetes.io/nvidia-gpu=4. I will introduce these processes in detail later when explaining Device Plugin.PodFitsHost checks whether the name of the host matches the spec.nodeName of the Pod.PodFitsHostPorts checks whether the host ports (spec.nodePort) applied for by the Pod conflict with ports that have already been used.PodMatchNodeSelector checks whether the nodes specified by the Pod's nodeSelector or nodeAffinity match the node under examination, and so on.It can be seen that such a set of GeneralPredicates is the most basic filtering condition for Kubernetes to examine whether a Pod can run on a Node. Therefore, GeneralPredicates are also directly called by other components (such as kubelet).

This set of filtering rules is responsible for the scheduling policy related to the persistent Volume of the container.Among them, NoDiskConflict checks whether there is a conflict between the persistent Volumes declared by multiple Pods. For example, AWS EBS-type Volumes are not allowed to be used by two Pods at the same time. So, when an EBS Volume named A has already been mounted on a certain node, another Pod that also declares the use of this A Volume cannot be scheduled to this node.MaxPDVolumeCountPredicate checks whether a certain type of persistent Volume on a node has exceeded a certain number. If so, then the Pod that declares the use of this type of persistent Volume can no longer be scheduled to this node.VolumeZonePredicate checks whether the Zone (high availability domain) label of the persistent Volume matches the Zone label of the node under examination.In addition, there is a rule called VolumeBindingPredicate. It is responsible for checking whether the nodeAffinity field of the PV corresponding to the Pod matches the label of a certain node.Local Persistent Volume (local persistent volume) must use nodeAffinity to bind to a specific node. This actually means that during the Predicates phase, Kubernetes must be able to schedule based on the Volume attributes of the Pod.In addition, if the Pod's PVC has not yet been bound to a specific PV, the scheduler is also responsible for checking all the PVs to be bound. When there is an available PV and the nodeAffinity of this PV is consistent with the node under examination, this rule will return "success". For example:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-local-pv
spec:
  capacity:
    storage: 500Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /mnt/disks/vol1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - my-node

It can be seen that the persistent directory corresponding to this PV will only appear on the host named my-node. Therefore, any Pod that uses this PV through a PVC must be scheduled to my-node to function properly. The VolumeBindingPredicate is exactly where the scheduler makes this decision.

This set of rules mainly examines whether the Pod to be scheduled meets certain conditions of the Node itself.For example, PodToleratesNodeTaints checks the "taint" mechanism of the Node that we often use before. Only when the Pod's Toleration field matches the Node's Taint field can this Pod be scheduled to the Node.NodeMemoryPressurePredicate checks whether the current Node's memory is no longer sufficient. If so, the Pod waiting to be scheduled cannot be scheduled to this Node.

This set of rules overlaps with most of the GeneralPredicates. What is special is the PodAffinityPredicate. The role of this rule is to check the affinity and anti-affinity relationships between the Pod to be scheduled and the existing Pods on the Node. For example:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

In this example, the podAntiAffinity rule specifies that this Pod does not want to exist on the same Node as any Pod carrying the security=S2 label. It should be noted that PodAffinityPredicate has a scope, such as the above rule, which is only valid for Nodes carrying the Key kubernetes.io/hostname label. This is the role of the keyword topologyKey.These four types of Predicates constitute the basic strategy for the scheduler to determine whether a Node can run the Pod to be scheduled.When executed, when a Pod is scheduled, the Kubernetes scheduler will simultaneously start 16 Goroutines to concurrently calculate Predicates for all Nodes in the cluster, and finally return the list of hosts that can run this Pod.It should be noted that when executing Predicates for each Node, the scheduler will check in a fixed order. This order is determined according to the meaning of the Predicates themselves. For example, host-related Predicates will be checked earlier. Otherwise, calculating PodAffinityPredicate on a host with severely insufficient resources is meaningless.

Priorities

After the "filtering" of nodes in the Predicates phase, the work of the Priorities phase is to score these nodes. The scoring range here is 0-10 points, and the node with the highest score is the best node to which the Pod is finally bound.The most commonly used scoring rule in Priorities is LeastRequestedPriority. This algorithm actually selects the host with the most idle resources (CPU and Memory).In addition, there are three other Priorities: NodeAffinityPriority, TaintTolerationPriority, and InterPodAffinityPriority. As the names suggest, they are similar in meaning and calculation method to the previous three Predicates: PodMatchNodeSelector, PodToleratesNodeTaints, and PodAffinityPredicate. However, as a Priority, the more fields a Node meets the above rules, the higher its score will be.In the default Priorities, there is also a strategy called ImageLocalityPriority. It is a new scheduling rule enabled in Kubernetes v1.12, that is: if the image needed by the Pod to be scheduled is large and already exists on some Nodes, then these Nodes will have a higher score.Of course, to avoid the algorithm causing scheduling stacking, the scheduler will also optimize according to the distribution of the image when calculating the score, that is: if the number of nodes where the large image is distributed is very few, then the weight of these nodes will be reduced, thereby "offsetting" the risk of causing scheduling stacking.In summary, this is the main working principle of the default scheduling rules in Kubernetes' scheduler.In the actual execution process, the information about the cluster and Pods in the scheduler has been cached, so the execution process of these algorithms is relatively fast.In addition, for more complex scheduling algorithms, such as PodAffinityPredicate, they not only pay attention to the Pod to be scheduled and the Node under examination when calculating, but also need to pay attention to the information of the entire cluster, such as traversing all nodes and reading their Labels. At this time, the Kubernetes scheduler will first calculate the cluster information needed by the algorithm in advance before executing the scheduling algorithm for each Pod to be scheduled, and then cache it. In this way, when actually executing the algorithm, the scheduler only needs to read the cached information for calculation, thus avoiding the need to repeatedly retrieve and calculate the entire cluster's information for each Node when calculating Predicates.

Summary

In summary, this article has discussed the main scheduling algorithms within Kubernetes' default scheduler. It's important to note that in addition to the rules covered in this article, there are actually some strategies within the Kubernetes scheduler that are not enabled by default. You can specify a configuration file for kube-scheduler or create a ConfigMap to configure which rules to enable or disable. Furthermore, you can control the scheduler's behavior by setting weights for the Priorities.This concludes the explanation of the default scheduling mechanisms in Kubernetes' scheduler. It's worth mentioning that while the scheduler operates quickly due to the caching of cluster and Pod information, for more complex algorithms, the scheduler must take into account the entire cluster's state, not just the Pod and Node in question. This includes preliminary calculations and caching of necessary cluster information before the actual execution of the scheduling algorithm, ensuring efficiency and accuracy in the scheduling process.

Novita AI is the all-in-one cloud platform that empowers your AI ambitions. With seamlessly integrated APIs, serverless computing, and GPU acceleration, we provide the cost-effective tools you need to rapidly build and scale your AI-driven business. Eliminate infrastructure headaches and get started for free - Novita AI makes your AI dreams a reality.
Recommended Reading:
  1. What You Need to Know About Docker
  2. Why We Need Pods