Using GPU Workload in Welkin¶

For Welkin Managed Customers

You can order a new Environment with GPU support by filing a service ticket. Make sure to specify the need for GPU Nodes in "Additional information or comments". If you are unsure, get in touch with your account manager.

As the demand for AI, machine learning, and data science workloads grows, Kubernetes provides a flexible and scalable platform to manage these applications. In this guide, we'll focus on how to use GPU in the Welkin platform.

Note

Not all infrastructure providers have support for GPU. Check with the platform administrator to find out if your environment has support for GPU workload.

Deployment¶

To use GPU resources in your Cluster, you need to create a Deployment that is using the resource nvidia.com/gpu. Here's an example of how to configure GPU resources for a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1

Managing Node Types for GPU Workloads¶

GPU Nodes are essential for certain high-performance workloads, such as machine learning or data processing. Clusters often host a variety of workloads, some of which require GPU Nodes, while others can run on standard CPU Nodes. To optimize resource usage and prevent unnecessary costs—since GPU Nodes are significantly more expensive than CPU Nodes—we have introduced multiple Node types with corresponding labels and taints. This ensures workloads are deployed on appropriate Node types based on their requirements.

Node Labels and Taints¶

To enforce resource-specific scheduling, Nodes have been configured with the following labels and taints:

CPU Nodes:
- Label: elastisys.io/node-group=worker
- Taint: None
- These Nodes serve as the default option for general workloads and require no additional configuration in Pod specifications.
GPU Nodes:
- Labels:
  - elastisys.io/node-type=gpu
  - elastisys.io/node-group=gpu-worker
- Taint: elastisys.io/node-type=gpu:NoSchedule
- These Nodes are reserved exclusively for workloads requiring GPU resources. To schedule Pods on GPU Nodes, specific Node affinity and toleration must be configured in the Pod definition.

By using these labels and taints, we ensure that:

General workloads remain on CPU Nodes by default.
Only workloads explicitly configured to use GPUs are scheduled on GPU Nodes.

For Welkin Enterprise Customers

Elastisys can help you ensure that your GPU Nodes are labeled and tainted as described in this page. This prevents workloads from unintentionally consuming GPU resources, thereby avoiding unnecessary expenses.

How to Configure GPU Workloads¶

Application developers who need to deploy workloads on GPU Nodes must include the appropriate affinity and toleration settings in the Pod specification. Below is an example configuration:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
            - key: elastisys.io/node-type
              operator: In
              values:
              - gpu
  tolerations:
    - effect: NoSchedule
      key: elastisys.io/node-type
      operator: Equal
      value: gpu

This configuration ensures the workload is scheduled only on GPU Nodes.

Advanced example Scenario¶

Let’s imagine you have a workload that specifically requires a GPU flavor called Standard_B2s. You want to ensure this workload is scheduled exclusively on Nodes with this GPU type. Here’s how you can achieve this:

Ask your platform administrator, or if you're an Elastisys Managed Service Customer you can file a service ticket here and ask to add a new GPU flavor Standard_B2s and a label to that GPU Node like node.kubernetes.io/instance-type=Standard_B2s
Modify the workload manifest file to include the additional label. The updated manifest would look like this:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
            - key: elastisys.io/node-type
              operator: In
              values:
              - gpu
            - key: node.kubernetes.io/instance-type
              operator: In
              values:
              - Standard_B2s
  tolerations:
    - effect: NoSchedule
      key: elastisys.io/node-type
      operator: Equal
      value: gpu

This configuration ensures that the workload is only scheduled on GPU Nodes labeled with both elastisys.io/node-type=gpu and node.Kubernetes.io/instance-type=Standard_B2s.

Note

If your Cluster is using the Cluster autoscaling feature and there's currently not enough resources, the autoscaler will create one for you. It might take a couple of minutes for the new Node to join the Cluster and to install all the pre-requisites.