gpu-cluster-auto configurations

Cluster has two node pools

basic pool containing cpu nodes

node pool size is set to 1 with auto scaling disabled. only one node runs in this node pool since more nodes and autoscaling is not required for it

New Relic system is also run on the 1 cpu node in this node pool

A deployement of k8-event-streamer service is running on this node pool with max replicas set to 1 meaning at any time one Kubernetes will ensure that 1 Ready status pod for this deployment is running constantly

A Daemonset of cpu_image_puller pod runs on this node pool. This daemonset caches the training image on the cpu node so that it can be used for image streaming when a custom job for GPU nodes come. it allows gpu nodes to scale down to 0 without increasing pod creation times for training-image pods

keep in mind that the container for this daemonset pod goes to sleep for 26 mins after creation and after training image has been pulled into the node

the pod restarts after finishing and keeps restarting after finishing indefinitely since it is being managed by a daemonset

the training image is downloaded by the first run and the subsequent runs of the cpu-image-puller pod uses cached image

since the pod goes to sleep after starting, it only consumes node resources for the first time it runs for downloading and caching training image, for each subsequent runs it does not consumes any resources of the node

default pool containing cpu nodes

Auto scaler enabled and node pool size can range between 0-110

All nodes in this node pool are tainted by the label “temporary:true:NoSchedule” with key: temporary vaue:true and effect:NoSchedule this taint ensures that no system pods like Kube-dns are scheduled on this node that can cause a gpu node to stay up even if it is not runnung any training or inference nodes

only the pods of our custom jobs have the toleration of the above taint and only they can run on these nodes

Note: the above taint does not stop system daemonsets from running, these daemonsets are kubernetes system daemonsets that runs pod that are necessary for the each node in the cluster. No custom daemonsets or deployments are configured for this node pool

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.