Skip to content
Gallery
Envision
Share
Explore
Kubernetes

Kubernetes cluster customizations

/Pulling Training image in cpu node to be used for image streaming
In Kube-Client Repo run this command kubectl apply -f image-prepull-job.yaml with the necessary configs of node selector in the yaml file to run an image pre-pull job on a cpu node. This will cache images on the cpu node that can be used by image streaming to quickly deploy the image on a new gpu node
This removes the need to keep a gpu node scaled up constantly, optimizing cluster costs
/Creating a daemonset that pre-pulls the training image in the cpu node when a cpu node is created
In kube-client repo
kubectl apply -f image-prepull-daemonset.yml
the above command will create a daemonset on the cluster that will pull the training image on a new node created in the basic pool (cpu node pools)
this process automates the image pulling on cpu nodes
check existing daemonsets kubectl get daemonsets
The prepull job will pull the training image and the pods sleeps for 26 mins
Since the pod is ran through a daemonset it restarts after finishing, this is safe and does not consume unnecessary resources since after the first time all pod retries will use cached image and go to sleep after container initialization consuming no resources while sleeping
/Adding anti afiniti rules to pods to make one to one correspondence of pods to nodes:
we can create afiniti/anti-affiniti rules to specify how pods select nodes to run
in our case we specified that only a single custom pod could run on a gpu node by checking if a job with same label is already runing on the node through an anti-afiniti rule
template = k8s_client.V1PodTemplateSpec(
metadata=k8s_client.V1ObjectMeta(labels={"app": "ml"}),
spec=k8s_client.V1PodSpec(
restart_policy="Never",
containers=[container],
node_selector={"cloud.google.com/gke-accelerator": "nvidia-tesla-t4"},
affinity=k8s_client.V1Affinity(
pod_anti_affinity=k8s_client.V1PodAntiAffinity(
required_during_scheduling_ignored_during_execution=[
k8s_client.V1PodAffinityTerm(
label_selector=k8s_client.V1LabelSelector(
match_expressions=[
k8s_client.V1LabelSelectorRequirement(
key="app",
operator="In",
values=["ml"]
)
]
),
topology_key="kubernetes.io/hostname",
)
]
)
),
tolerations=[toleration]
)
)
/Specifying purpose for node pools and disallowing system pods to run on them:
Problem statement: GKE would schedule system pods on gpu nodes. This would cause those nodes to stay up and not scale down, adding in cost. solution:
we added a taint to gpu node pool such that every node has a taint.
this taint would tell GKE to not schedule any system-pods on these gpu nodes solving the issue of nodes not scaling down
we added tolerations againt the node taint to the custom jobs, allowing the specific purposeful jobs to run on the gpu nodes
node pool taint command
gcloud container node-pools create default-pool --cluster=gpu-cluster-auto --zone=us-central1-f --node-taints=temporary=true:NoSchedule --machine-type=n1-standard-4 --accelerator=type=nvidia-tesla-t4,count=1,gpu-driver-version=default --scopes="https://www.googleapis.com/auth/devstorage.full_control","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/trace.append" --num-nodes=0 --min-nodes=0 --max-nodes=10
code for adding tolerations to custom jobs

toleration = k8s_client.V1Toleration(
key="temporary",
operator="Equal",
value="true",
effect="NoSchedule"
)

template = k8s_client.V1PodTemplateSpec(
metadata=k8s_client.V1ObjectMeta(labels={"app": "ml"}),
spec=k8s_client.V1PodSpec(
restart_policy="Never",
containers=[container],
node_selector={"cloud.google.com/gke-accelerator": "nvidia-tesla-t4"},
affinity=k8s_client.V1Affinity(
pod_anti_affinity=k8s_client.V1PodAntiAffinity(
required_during_scheduling_ignored_during_execution=[
k8s_client.V1PodAffinityTerm(
label_selector=k8s_client.V1LabelSelector(
match_expressions=[
k8s_client.V1LabelSelectorRequirement(
key="app",
operator="In",
values=["ml"]
)
]
),
topology_key="kubernetes.io/hostname",
)
]
)
),
tolerations=[toleration]
)
)

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.