Improve Cluster Balance with the CPD Scheduler — Part 1

Published in

IBM Data Science in Practice

7 min readAug 23, 2023

The default Kubernetes (“k8s”) scheduler can be thought of as a sort of “greedy” scheduler, in that it always tries to place pods on the nodes that have the most free resources. This can lead to problems, as it can cause one or more nodes to become overloaded while others are underutilized. This can lead to performance problems and even outages. Additionally, this problem is frequently not noticed until services hit a wall due to limited resource availability on the overworked worker node.

Previous fixes had limitations. A guide was published in the IBM Knowledge Center to tune default scheduler behavior. While it provided some improvements, it did not fundamentally resolve the issue.

Identifying the Root Cause

To identify the root cause of the issue, we analyzed the default scheduler code. It became apparent that the default Kubernetes scheduler algorithm was the culprit. Some typical examples of suboptimal outcomes:

NodeResourcesBalancedAllocation
Sort nodes by variance of node’s cpu/memory/volume usage — sum(requested)/capacity. Instead of prioritizing the node with the most capacity, the node with smaller variance is preferred. This frequently exacerbates cluster imbalance.

NodeResourcesLeastAllocated/NodeResourceFit
Sort nodes by its free cpu/memory/volume, the node with more free resources is preferred.
The algorithm is (cpu((capacity-sum(requested))*MaxNodeScore/capacity) + memory((capacity-sum(requested))*MaxNodeScore/capacity))/weightSum. This is hidden the workload has certain resource dominated.

PodTopologySpread
The pod topology spread policy has a default policy for deployments and stateful sets. The k8s documentation on the built-in constraints is here. The goal of this policy is to put multiple replicas of the same deployment or stateful set on different nodes and in different zones. Unlike manually configured pod topology constraints, nodes don’t need to have the zone label for the built-in constraints to take affect. These constraints have higher precedence than the CPD scheduler’s balancing policy, thus they can sometimes conflict with LessCPURequests. Take for example the case where 2 nodes have different available cpu requests but one pod of a deployment is already on the node with more available cpu requests. Since HA takes precedence over balanced cpu requests, the CPD scheduler will still put the second pod on the other node. However, the user has the option to disable the built-in constraints. Details on disabling default constraints can be found in the k8s documentation.

Introducing CPD Scheduler

To address these issues, we developed the CPD scheduler with a more “balanced” approach. It takes into account the amount of free resources on each node, as well as the resource requirements of the pods that are being scheduled. This helps to ensure that pods are evenly distributed across the cluster, which minimizes the risk of imbalance.

One way to think about the difference between the two schedulers is to imagine a group of people who are trying to get on a bus. The default scheduler would be like the first person in line who always tries to get on the bus that is already the most crowded. The CPD scheduler, on the other hand, would be like the person who tries to get on the bus that is the least crowded.

How it Works

The CPD scheduler uses a new scoring algorithm that takes into account the amount of free resources on each node, as well as the resource requirements of the pods that are being scheduled. This helps to ensure that pods are evenly distributed across the cluster, which minimizes the risk of imbalance.

In addition to the new scoring algorithm, the CPD scheduler also includes a number of other features that help to improve cluster balancing. For example, the CPD scheduler can be configured to prioritize pods that are running CPU-intensive workloads, or pods that are part of a specific service. This helps to ensure that these pods are not scheduled on overloaded nodes.

The CPD scheduler has been shown to be effective in improving cluster balancing. In a recent test, the CPD scheduler was able to significantly reduce the amount of imbalance in a cluster. This resulted in improved performance and reduced the risk of outages.

New Score Function

Currently, the k8s scheduler plugin framework supports Score Function for plugins in it. The range of return value from a Score Function is 0~100, and there will be a weight (positive integer) for each plugin. The return value of Score Function is a dynamic value, and the weight is a static value in the scheduler lifecycle.

The pods will be scheduled one by one, when selecting a node for a pod, for every node, each Score Function in a plugin returns a value, after multiplying by the weight for this plugin, it is the score for this node of this plugin, adding up all the score for a node from all plugins, it is the final score for the node. The node with bigger final score will be selected to run this pod

CPD scheduler introduce set of new score function and a parameter nodePreference is added into k8s configmap to control the return value of new Score Function, valid configurations of this parameter are:

LessGPURequest
LessCPURequest
LessMemRequest
LessGPULimit
LessCPULimit
LessMemLimit
MoreGPURequest
MoreCPURequest
MoreMemRequest
MoreGPULimit
MoreCPULimit
MoreMemLimit

The out of box setting is nodePreference: LessGPURequest

User can tune the behaviour if they want to balance cluster based on different resource dimensions. For example, if user’s workload is more important in memory request, they can use LessMemoryRequest policy

With configuration LessCPURequest configured. Scheduler get all CPU on the node (AllCPUs), which include free CPU and allocated CPU and allocated CPU (AllocatedCPURequests, the sum of requests for CPU from all pods on this node). Node score based on free CPU capacity score

Score = (AllCPUs — AllocatedCPURequests) / AllCPUs,

LessCPURequest allow scheduler find the node with free capacity to put the pod and result in a balanced cluster allocation in term of CPU request allocation.

Performance and Test Results

Test 1: Dynamic Workload

Waston Service Workload
— Create job from this notebook: using spark3.2 py3.9 version
— The job flow as follows
> Step 1: Load Data as Spark Dataframe
> Step 2: Merge Files (csv files)
> Step 3: Simple Data Preparation — Rename some columns and ensure correct data types
> Step 4: Data Exploration
> Step 5: Build the Spark pipeline and the Random Forest model
> Step 6: Score the test data set
> Step 7: Model Evaluation
20 concurrent users, there will be potentially 20 notebook job, 20 jkg-deployment, 20 spark-worker

Environment

3 master nodes: 8 CPU cores, 32GB memory (each)
5 worker nodes: 44 CPU cores, 128GB memory (each)
Total worker capacity: 220 CPU cores, 640GB memory
CP4D 4.7.0 DEV build from Feb 17, 2023

Results

All jobs completed successfully.
Detailed Prometheus metrics

Prometheus shows the custom schedule’s “least CPU requests” policy, and shows load consumed the full cluster capacity

(sum by (node) (kube_node_status_allocatable {node=~’worker.*’,resource=’cpu’}))- (sum by (node) (kube_pod_container_resource_requests{node=~’worker.*’,resource=’cpu’} * on (pod,namespace) group_left() (kube_pod_status_phase{phase=~’(Running)’}==1)))

Default k8s scheduler

Cluster starts with nodes imbalanced from static pods
Dynamic pods started by jobs repeatedly run one node out of CPU Requests capacity

CPU allocation Per Node with Default K8s Scheduler

CPD Scheduler

Cluster starts with nodes imbalanced from static pods. Dynamic pods started by jobs are started on nodes with most capacity, never running nodes out of CPU Requests capacity.

CPU allocation Per Node with CPD Scheduler

Side by side compare

Same two tests as above. On the left is the run with the default k8s scheduler and on the right is the run with the CPD Scheduler’s improved balancing

CPU allocation Per Node comparison

Test Conclusion

k8s default scheduler intent to make an unbalance cluster worse. The Node in purple is a typical example, default scheduler is intent to use up all resource from the node. Whereas CPD scheduler prioritizes diverting workloads from over-used nodes. New pod is placed on the node has more cpu capacity.

Test 2: CPD scheduler for initial CPD installation

Key timepoints

13:50 cluster provisioned and Prometheus metrics available, CPFS/CPD install starts
14:30 CPD platform and Zen install done
14:45 CPD Scheduler install done
15:10–16:00 CCS, DR, and WSL installs run
16:15 one worker cordoned and drained
16:25–15:00 all Zen namespace pods manually restarted

CPU allocation Per Node during installation

Co-existing with other scheduler plugin

The out of the box setting of CPD Scheduler co-exists with other scheduler plugins to ensure those policies are not ignored after user enable the lessCPURequest.

score:
          enabled:
          - name: TaintToleration
            weight: 300
          - name: NodeAffinity
            weight: 200
          - name: PodTopologySpread
            weight: 200
          - name: InterPodAffinity
            weight: 200
          - name: ImageLocality
            weight: 100
          - name: Hpac
            weight: 1
          disabled:
          - name: TaintToleration
          - name: NodeAffinity
          - name: NodeResourcesFit
          - name: PodTopologySpread
          - name: InterPodAffinity
          - name: NodeResourcesBalancedAllocation
          - name: ImageLocality

The configuration disables the plugin NodeResourcesBalancedAllocation and NodeResourceFit. the algorithm in the default scheduler is not fit into product cluster, so it conflicts with balance cluster algorithm. Other policy raise the score TaintToleration, NodeAffinity, InterPodAffinty, and ImageLocality. Pod Balanced score like lessCPURequest is factor after those affinity setting. This approach allow us honour those affinity setting and consider the cluster balance as well.

Conclusion

Overall, the CPD scheduler is a valuable tool for improving cluster balancing. It is easy to use and can be configured to meet the specific needs of a cluster. If you are experiencing problems with cluster imbalance, the CPD scheduler is a great option to consider.

Part 2 covers more comprehensive test cases.

Acknowledgments

The author would like to thank the colleague Jun Zhu, Yongli An who helped deliver the test results in this report, Michael Closson for supporting us on this effort.