Skip to content

Conversation

@JiGuoDing
Copy link
Contributor

Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation for Case Study Page.

…se Study Documentation

Signed-off-by: JiGuoDing <485204300@qq.com>
… Case Study Documentation

Signed-off-by: JiGuoDing <485204300@qq.com>
Copilot AI review requested due to automatic review settings February 9, 2026 08:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an Amazon EKS case study page describing how to run workloads on P6e-GB200 UltraServers, including required cluster/node components and an end-to-end installation/validation procedure.

Changes:

  • Adds a new Amazon case study page under docs/ and its versioned_docs copies for v0.9 and v1.0.
  • Documents recommended software components (GPU Operator, NVIDIA DRA driver, EFA plugin) and provides Helm-based install steps.
  • Includes a sample MPIJob/ComputeDomain manifest to validate IMEX over multi-node NVLink.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File Description
docs/case-study/amazon-case-study.md New Amazon EKS + P6e-GB200 UltraServers case study documentation and installation steps
versioned_docs/version-v0.9/case-study/amazon-case-study.md Versioned copy of the same case study content for v0.9 docs
versioned_docs/version-v1.0/case-study/amazon-case-study.md Versioned copy of the same case study content for v1.0 docs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

name: mpi-worker
securityContext:
runAsUser: 1000
env:
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.

Suggested change
env:

Copilot uses AI. Check for mistakes.
Comment on lines +347 to +354
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.

Suggested change
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique

Copilot uses AI. Check for mistakes.
name: mpi-worker
securityContext:
runAsUser: 1000
env:
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.

Suggested change
env:

Copilot uses AI. Check for mistakes.
Comment on lines +247 to +257
2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step.

```shell
helm repo add eks https://aws.github.io/eks-charts
helm repo update
```

```shell
helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
--version="0.5.14" \
-f efa-values.yaml
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.

Copilot uses AI. Check for mistakes.
Comment on lines +247 to +257
2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step.

```shell
helm repo add eks https://aws.github.io/eks-charts
helm repo update
```

```shell
helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \
--version="0.5.14" \
-f efa-values.yaml
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.

Copilot uses AI. Check for mistakes.
Comment on lines +347 to +354
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.

Suggested change
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique

Copilot uses AI. Check for mistakes.
Signed-off-by: JiGuoDing <485204300@qq.com>
Signed-off-by: JiGuoDing <485204300@qq.com>
Signed-off-by: JiGuoDing <485204300@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant