docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43

JiGuoDing · 2026-02-09T08:20:33Z

Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation for Case Study Page.

…se Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>

… Case Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>

Copilot

Pull request overview

Adds an Amazon EKS case study page describing how to run workloads on P6e-GB200 UltraServers, including required cluster/node components and an end-to-end installation/validation procedure.

Changes:

Adds a new Amazon case study page under docs/ and its versioned_docs copies for v0.9 and v1.0.
Documents recommended software components (GPU Operator, NVIDIA DRA driver, EFA plugin) and provides Helm-based install steps.
Includes a sample MPIJob/ComputeDomain manifest to validate IMEX over multi-node NVLink.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File	Description
docs/case-study/amazon-case-study.md	New Amazon EKS + P6e-GB200 UltraServers case study documentation and installation steps
versioned_docs/version-v0.9/case-study/amazon-case-study.md	Versioned copy of the same case study content for v0.9 docs
versioned_docs/version-v1.0/case-study/amazon-case-study.md	Versioned copy of the same case study content for v1.0 docs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T08:25:59Z

versioned_docs/version-v1.0/case-study/amazon-case-study.md

+            name: mpi-worker
+            securityContext:
+              runAsUser: 1000
+            env:


The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.

Suggested change

env:

Copilot · 2026-02-09T08:26:00Z

docs/case-study/amazon-case-study.md

+              requiredDuringSchedulingIgnoredDuringExecution:
+              - labelSelector:
+                  matchExpressions:
+                  - key: nvbandwidth-test-replica
+                    operator: In
+                    values:
+                    - mpi-worker
+                topologyKey: nvidia.com/gpu.clique


The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.

Suggested change

requiredDuringSchedulingIgnoredDuringExecution:

- labelSelector:

matchExpressions:

- key: nvbandwidth-test-replica

operator: In

values:

- mpi-worker

topologyKey: nvidia.com/gpu.clique

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchExpressions:

- key: nvbandwidth-test-replica

operator: In

values:

- mpi-worker

topologyKey: nvidia.com/gpu.clique

Copilot · 2026-02-09T08:26:00Z

versioned_docs/version-v0.9/case-study/amazon-case-study.md

+            name: mpi-worker
+            securityContext:
+              runAsUser: 1000
+            env:


The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.

Suggested change

env:

versioned_docs/version-v0.9/case-study/amazon-case-study.md

Copilot · 2026-02-09T08:26:00Z

versioned_docs/version-v1.0/case-study/amazon-case-study.md

+2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step.
+
+```shell
+helm repo add eks https://aws.github.io/eks-charts
+helm repo update
+```
+
+```shell
+helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
+  --version="0.5.14" \
+  -f efa-values.yaml


This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.

Copilot · 2026-02-09T08:26:03Z

docs/case-study/amazon-case-study.md

+2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step.
+
+```shell
+helm repo add eks https://aws.github.io/eks-charts
+helm repo update
+```
+
+```shell
+helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
+  --version="0.5.14" \
+  -f efa-values.yaml


This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.

docs/case-study/amazon-case-study.md

versioned_docs/version-v0.9/case-study/amazon-case-study.md

Copilot · 2026-02-09T08:26:04Z

versioned_docs/version-v0.9/case-study/amazon-case-study.md

+              requiredDuringSchedulingIgnoredDuringExecution:
+              - labelSelector:
+                  matchExpressions:
+                  - key: nvbandwidth-test-replica
+                    operator: In
+                    values:
+                    - mpi-worker
+                topologyKey: nvidia.com/gpu.clique


The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.

Suggested change

requiredDuringSchedulingIgnoredDuringExecution:

- labelSelector:

matchExpressions:

- key: nvbandwidth-test-replica

operator: In

values:

- mpi-worker

topologyKey: nvidia.com/gpu.clique

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchExpressions:

- key: nvbandwidth-test-replica

operator: In

values:

- mpi-worker

topologyKey: nvidia.com/gpu.clique

Signed-off-by: JiGuoDing <485204300@qq.com>

JiGuoDing added 2 commits February 9, 2026 15:59

docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Ca…

9c2b0eb

…se Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>

docs(amazon-case-study): Update Amazon EKS and P6e-GB200 UltraServers…

82a7ad5

… Case Study Documentation Signed-off-by: JiGuoDing <485204300@qq.com>

Copilot AI review requested due to automatic review settings February 9, 2026 08:20

Copilot started reviewing on behalf of JiGuoDing February 9, 2026 08:20 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

JiGuoDing added 3 commits February 9, 2026 16:38

fix(docs): correct "rowspan" to "rowSpan"

5f53490

Signed-off-by: JiGuoDing <485204300@qq.com>

fix(docs): correct "rowspan" to "rowSpan"

6911058

Signed-off-by: JiGuoDing <485204300@qq.com>

fix(docs): correct "recommend" to "recommended"

923928a

Signed-off-by: JiGuoDing <485204300@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43

docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43

JiGuoDing commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43

Are you sure you want to change the base?

docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation #43

Conversation

JiGuoDing commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant