-
Notifications
You must be signed in to change notification settings - Fork 6
docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…se Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>
… Case Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds an Amazon EKS case study page describing how to run workloads on P6e-GB200 UltraServers, including required cluster/node components and an end-to-end installation/validation procedure.
Changes:
- Adds a new Amazon case study page under
docs/and itsversioned_docscopies for v0.9 and v1.0. - Documents recommended software components (GPU Operator, NVIDIA DRA driver, EFA plugin) and provides Helm-based install steps.
- Includes a sample MPIJob/ComputeDomain manifest to validate IMEX over multi-node NVLink.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| docs/case-study/amazon-case-study.md | New Amazon EKS + P6e-GB200 UltraServers case study documentation and installation steps |
| versioned_docs/version-v0.9/case-study/amazon-case-study.md | Versioned copy of the same case study content for v0.9 docs |
| versioned_docs/version-v1.0/case-study/amazon-case-study.md | Versioned copy of the same case study content for v1.0 docs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name: mpi-worker | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| env: |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.
| env: |
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - labelSelector: | ||
| matchExpressions: | ||
| - key: nvbandwidth-test-replica | ||
| operator: In | ||
| values: | ||
| - mpi-worker | ||
| topologyKey: nvidia.com/gpu.clique |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.
| requiredDuringSchedulingIgnoredDuringExecution: | |
| - labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique | |
| preferredDuringSchedulingIgnoredDuringExecution: | |
| - weight: 100 | |
| podAffinityTerm: | |
| labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique |
| name: mpi-worker | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| env: |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.
| env: |
| 2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step. | ||
|
|
||
| ```shell | ||
| helm repo add eks https://aws.github.io/eks-charts | ||
| helm repo update | ||
| ``` | ||
|
|
||
| ```shell | ||
| helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \ | ||
| --version="0.5.14" \ | ||
| -f efa-values.yaml |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.
| 2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step. | ||
|
|
||
| ```shell | ||
| helm repo add eks https://aws.github.io/eks-charts | ||
| helm repo update | ||
| ``` | ||
|
|
||
| ```shell | ||
| helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \ | ||
| --version="0.5.14" \ | ||
| -f efa-values.yaml |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - labelSelector: | ||
| matchExpressions: | ||
| - key: nvbandwidth-test-replica | ||
| operator: In | ||
| values: | ||
| - mpi-worker | ||
| topologyKey: nvidia.com/gpu.clique |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.
| requiredDuringSchedulingIgnoredDuringExecution: | |
| - labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique | |
| preferredDuringSchedulingIgnoredDuringExecution: | |
| - weight: 100 | |
| podAffinityTerm: | |
| labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique |
Signed-off-by: JiGuoDing <485204300@qq.com>
Signed-off-by: JiGuoDing <485204300@qq.com>
Signed-off-by: JiGuoDing <485204300@qq.com>
Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation for Case Study Page.