Skip to content

feat: add --subnets flag to deploy multiple nodes per client#136

Open
ch4r10t33r wants to merge 36 commits intomainfrom
subnets
Open

feat: add --subnets flag to deploy multiple nodes per client#136
ch4r10t33r wants to merge 36 commits intomainfrom
subnets

Conversation

@ch4r10t33r
Copy link
Contributor

@ch4r10t33r ch4r10t33r commented Mar 17, 2026

Summary

  • Adds a `--subnets N` flag (N = 1–5) to deploy N independent copies of each client on the same server. Each copy gets a unique name (`{client}_0` … `{client}_N-1`), incrementally offset ports, and a fresh P2P identity key — nodes on the same host never share a subnet.
  • `generate-subnet-config.py` — new script that expands `validator-config.yaml` into `validator-config-subnets-N.yaml`; validates that no two template entries share an IP or client type; adds an explicit `subnet:` field to each generated entry; sets `config.attestation_committee_count = N` so clients partition attestation committees correctly.
  • Updates `--prepare` to open the full port range for all N subnet nodes per host (ports `base` … `base+N-1` for QUIC/UDP, metrics/TCP, and API/TCP), by matching validator entries by IP rather than by hostname.
  • Adds `--dry-run` flag — full deployment simulation without applying any changes (Ansible runs with `--check --diff`, local execs are echoed only, genesis generation is skipped).
  • Aggregator selection overhauled: one aggregator per subnet chosen randomly, with pre-existing `isAggregator: true` in the YAML honoured as the user's choice (overriding random selection). An invariant check hard-fails if any subnet ends up with ≠ 1 aggregator.
  • Subnet membership derived from the explicit `subnet:` field in the config (not from node name suffix), so nodes like `ethlambda_1` in a single-subnet config are not mistakenly treated as belonging to subnet 1.
  • `copy-genesis.yml` and `deploy-nodes.yml` now copy only the hash-sig keys for each host's own validators (not the entire directory), scoped via `annotated_validators.yaml`.
  • Adds `docs/adding-a-new-client.md` — a comprehensive step-by-step guide for integrating a new Lean Ethereum client, linked from the README.

Files changed

File Change
`generate-subnet-config.py` New — subnet config generator
`spin-node.sh` `--subnets`, `--dry-run`, aggregator selection overhaul
`parse-env.sh` Parse `--subnets N` and `--dry-run` arguments
`run-ansible.sh` Pass `validator_config_basename` extra var; `--check --diff` in dry-run
`ansible/playbooks/deploy-nodes.yml` Use `validator_config_basename`; copy only per-node hash-sig keys
`ansible/playbooks/copy-genesis.yml` Copy only per-node hash-sig keys
`ansible/playbooks/prepare.yml` Open all subnet port ranges per host
`convert-validator-config.py` Fall back to `httpPort` for Lantern when generating leanpoint upstreams
`ansible-devnet/genesis/validator-config.yaml` Add `privkey` for commented-out clients
`README.md` Document `--subnets`, `--dry-run`, link to new client guide
`docs/adding-a-new-client.md` New — client integration guide (see below)

Adding a new client

The new guide at `docs/adding-a-new-client.md` covers the 6 files every new client must provide, with full code examples for each:

  1. `validator-config.yaml` (both `local-devnet` and `ansible-devnet`) — node entry with `privkey`, ports, and IP. Local uses `127.0.0.1`; Ansible uses the server's public IP (contact the zeam team to get a server assigned). Ports must be unique per server; `--subnets N` handles the expansion automatically.
  2. `client-cmds/myclient-cmd.sh` — defines `node_binary`, `node_docker`, and `node_setup`. Documents all injected variables (`$item`, `$configDir`, `$isAggregator`, `$attestationCommitteeCount`, etc.) and the required CLI flags (`--attestation-committee-count`, `--is-aggregator`, `--checkpoint-sync-url`, etc.). Client must expose `GET /v0/health`.
  3. `ansible/roles/myclient/defaults/main.yml` — fallback Docker image and deployment mode.
  4. `ansible/roles/myclient/tasks/main.yml` — full Ansible task file: extract image from `client-cmd.sh`, read ports from config, stop/start Docker container with core-dump, aggregator, and checkpoint-sync support.
  5. `ansible/playbooks/helpers/deploy-single-node.yml` — add `include_role` block and update the unknown-client-type guard.
  6. `README.md` — add to Clients supported list.

Everything else (genesis generation, key management, inventory generation, subnet expansion, leanpoint upstreams, aggregator selection, observability) is fully generic and requires no changes.

Test plan

  • `--subnets 2` generates correct `validator-config-subnets-2.yaml` with unique ports and keys
  • `--dry-run` prints simulation output without modifying any file or deploying anything
  • Aggregator selection respects pre-existing `isAggregator: true` and does not override it
  • Nodes with numeric suffixes (e.g. `ethlambda_1`) in a single-subnet config are all assigned to subnet 0
  • Hash-sig keys: each server receives only its own validator's `_sk.ssz` / `_pk.ssz` files
  • `--prepare` opens correct port ranges when used with `--subnets N`
  • New client guide is accurate — follow it end-to-end with a test client

Add support for configuring nodes as aggregators through validator-config.yaml.
This allows selective designation of nodes to perform aggregation duties by
setting isAggregator: true in the validator configuration.

Changes:
- Add isAggregator field (default: false) to all validators in both local and ansible configs
- Update parse-vc.sh to extract and export isAggregator flag
- Modify all client command scripts to pass --is-aggregator flag when enabled
- Add isAggregator status to node information output
Resolved conflicts in client-cmds scripts by keeping both:
- Aggregator flag support
- Checkpoint sync URL support

Updated Docker images:
- zeam: 0xpartha/zeam:devnet3
- lantern: piertwo/lantern:v0.0.3-test
- ethlambda: ghcr.io/lambdaclass/ethlambda:devnet3

Added httpPort support for lantern nodes.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review March 17, 2026 21:33
@ch4r10t33r ch4r10t33r requested a review from g11tech March 17, 2026 21:33
@ch4r10t33r ch4r10t33r added the enhancement New feature or request label Mar 17, 2026
@ch4r10t33r ch4r10t33r changed the title feat: add --subnets flag to deploy multiple nodes per client [WIP] feat: add --subnets flag to deploy multiple nodes per client Mar 18, 2026
@ch4r10t33r ch4r10t33r marked this pull request as draft March 18, 2026 07:33
Adds --subnets N (1–5) to deploy N nodes of each client on their
associated servers, each on a distinct attestation subnet.

New files:
  - generate-subnet-config.py: expands validator-config.yaml into
    validator-config-subnets-N.yaml with unique node names, incremented
    ports (quic/metrics/api), fresh P2P private keys, and explicit subnet
    membership per entry. Also sets config.attestation_committee_count = N
    so each client correctly partitions validators across N committees.

Changes:
  - parse-env.sh: add --subnets N and --dry-run flags
  - spin-node.sh:
    - expand validator-config before genesis setup when --subnets N given
    - select one aggregator per subnet randomly; print prominent summary
    - --dry-run: simulate full deployment without applying any changes
      (Ansible runs with --check --diff, local execs are echoed only)
  - run-ansible.sh: pass validator_config_basename extra var so playbooks
    use the active (possibly expanded) config; add --check --diff in dry-run
  - ansible/playbooks/deploy-nodes.yml: use validator_config_basename to
    sync the correct config file to remote hosts
  - ansible/playbooks/prepare.yml: open port ranges for all subnet nodes
    on a host by matching entries via IP, not just hostname
  - convert-validator-config.py: fall back to httpPort for Lantern nodes
    when generating Leanpoint upstreams
  - README.md: document --subnets and --dry-run; update --prepare firewall
    table to reflect port ranges when --subnets N is active

Rules enforced by generate-subnet-config.py:
  - No two nodes on the same server may share a subnet (template validated)
  - Each subnet has exactly one node per client
  - N=1 is a no-op expansion (single-subnet baseline)
  - N capped at 5
ch4r10t33r and others added 9 commits March 18, 2026 12:55
Previously both deploy-nodes.yml and copy-genesis.yml synced the entire
hash-sig-keys/ directory to every remote host, meaning every server
received every validator's sk/pk pair.

Now each playbook:
  1. Reads annotated_validators.yaml on the controller to look up the
     privkey_file entries for the node being deployed (inventory_hostname).
  2. Derives the pk filename by replacing _sk.ssz → _pk.ssz.
  3. Copies only those specific files to the target host.

A server running zeam_0 (validator_0_sk.ssz / validator_0_pk.ssz) no
longer receives validator_1_sk.ssz, validator_2_sk.ssz, etc.
…ffix

The old suffix-based detection (ethlambda_1 → subnet 1) broke when a
config contained multiple nodes for the same client without --subnets
(e.g. ethlambda_0..4 for redundancy), incorrectly creating 5 subnets
and forcing ethlambda nodes as the sole aggregator on subnets 1-4.

Subnet membership is now read from the explicit 'subnet:' field that
generate-subnet-config.py writes for each entry. Nodes without this
field (all standard configs) default to subnet 0, so a single-subnet
deployment always selects exactly one aggregator from all active nodes
regardless of numeric suffixes in their names.
…r flag is passed

Previously the script always reset all flags and randomly re-selected an
aggregator, ignoring any manual isAggregator: true already set in the
YAML. This caused ethlambda_0 (user's choice) to be silently replaced by
ethlambda_1 (random pick).

Aggregator selection now follows a three-level priority:
  1. --aggregator <node> CLI flag
  2. Pre-existing isAggregator: true in the config (manual YAML edit)
  3. Random selection (fallback when neither is set)

The preset node is validated against the active node list. If it no
longer exists a warning is printed and random selection takes over.
@ch4r10t33r ch4r10t33r marked this pull request as ready for review March 18, 2026 15:06
@ch4r10t33r ch4r10t33r changed the title [WIP] feat: add --subnets flag to deploy multiple nodes per client feat: add --subnets flag to deploy multiple nodes per client Mar 18, 2026
Copy link

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall well-structured PR — the subnet expansion model is clean and the per-node hash-sig key copying is a meaningful improvement. A few observations:

1. Double validation in spin-node.sh

The outer guard [ "$subnets" -ge 1 ] 2>/dev/null silently suppresses non-integer errors, and the inner guard then re-validates the same range. Combining into a single block would be cleaner:

if [ -n "$subnets" ]; then
  if ! [[ "$subnets" =~ ^[0-9]+$ ]] || [ "$subnets" -lt 1 ] || [ "$subnets" -gt 5 ]; then
    echo "Error: --subnets requires an integer between 1 and 5"
    exit 1
  fi
  # ... expansion logic
fi

2. MAX_SUBNETS = 5 in two places

generate-subnet-config.py and spin-node.sh both independently enforce the 1–5 range. They match today, but a future change in one won't automatically update the other. A cross-reference comment would help.

3. Private keys in ansible-devnet/genesis/validator-config.yaml

The privkey fields added for gean_0 and nlean_0 are P2P identity keys committed in plaintext. Consistent with how other devnet entries are handled, so presumably intentional — just confirming these are devnet-only keys.

4. run-ansible.sh positional arg expansion ($12)

Adding dryRun as $12 is safe — callers that don't pass it get an empty string (falsy). All spin-node.sh call sites pass it correctly.

5. Dynamic group discovery in run-ansible.sh

Replacing the hardcoded client-group list with yq eval .all.children | keys is a good improvement — new clients no longer require updating the list. One edge case: if yq is absent on the Ansible controller (localhost) and the || echo "" fallback fires, SSH key injection is silently skipped for all hosts. Worth an explicit yq check at the top of the script or at least a warning.

6. Per-node hash-sig key copying

Good improvement — only the sk/pk files assigned to each node are transferred. The when: node_hash_sig_files | length > 0 condition is correct. One question: if annotated_validators.yaml exists but a node has no assignments in it, the hash-sig directory is not created and no keys are copied — is that intentional (node needs no hash-sig keys) or should it emit a warning?

7. generate-subnet-config.py

The validation logic, port-increment scheme, secrets.token_hex(32) for P2P keys, and attestation_committee_count = N injection all look correct. The duplicate-IP / duplicate-client-type checks in _validate_template are solid defensive guards.

Overall looks good. Happy to approve once the double-validation in spin-node.sh is tidied up (or if you prefer to leave it with a comment, that is fine too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants