Skip to content

feat(kiloclaw): add controller telemetry checkins#1380

Merged
pandemicsyn merged 6 commits intomainfrom
florian/chore/controller-telemetry
Mar 23, 2026
Merged

feat(kiloclaw): add controller telemetry checkins#1380
pandemicsyn merged 6 commits intomainfrom
florian/chore/controller-telemetry

Conversation

@pandemicsyn
Copy link
Contributor

Summary

Add controller phone-home telemetry from Fly machines to the kiloclaw worker and store it in a dedicated Analytics Engine dataset for machine-health observability.

  • Introduces a new machine-to-worker check-in route: POST /api/controller/checkin.
  • Uses dual auth for check-ins: Bearer KILOCODE_API_KEY + x-kiloclaw-gateway-token.
  • Adds a second AE dataset binding (KILOCLAW_CONTROLLER_AE / kiloclaw_controller_telemetry) separate from lifecycle event telemetry.
  • Wires periodic controller check-ins (5-minute interval, 2-minute initial delay) into controller startup/shutdown.
  • Extracts and reuses openclaw version caching logic in controller/src/openclaw-version.ts.

Reported telemetry payload (every check-in):

Category Field Source
Identity sandboxId KILOCLAW_SANDBOX_ID env
Identity machineId FLY_MACHINE_ID (or explicit dep override)
Controller version controllerVersion compiled controller version constant
Controller version controllerCommit compiled controller commit constant
OpenClaw version openclawVersion cached openclaw --version probe
OpenClaw version openclawCommit cached openclaw --version probe
Process health supervisorState controller supervisor stats
Process health totalRestarts controller supervisor stats
Process health restartsSinceLastCheckin delta from prior check-in
Process health uptimeSeconds controller supervisor stats
Host metric loadAvg5m os.loadavg()[1]
Network metric bandwidthBytesIn /proc/net/dev delta
Network metric bandwidthBytesOut /proc/net/dev delta
Process detail lastExitReason derived from last exit signal/code
Infra label fly-region request header at worker ingress

AE datapoint layout (kiloclaw_controller_telemetry):

AE slot Value
blob1..blob9 sandboxId, controllerVersion, controllerCommit, openclawVersion, openclawCommit, supervisorState, flyRegion, machineId, lastExitReason
double1..double6 restartsSinceLastCheckin, totalRestarts, uptimeSeconds, loadAvg5m, bandwidthBytesIn, bandwidthBytesOut
index1 sandboxId

Verification

  • pnpm lint (in kiloclaw/) — pass
  • pnpm typecheck (in kiloclaw/) — pass
  • pnpm test (in kiloclaw/) — pass (44 files, 949 tests)
  • pnpm test controller/src/checkin.test.ts src/gateway/env.test.ts — pass
  • pnpm test controller/src/checkin.test.ts controller/src/routes/health.test.ts src/routes/controller.test.ts — pass (23 tests)
  • bash scripts/controller-smoke-test.sh — pass (11 passed, 0 failed)
  • bash scripts/controller-entrypoint-smoke-test.sh — pass (5 passed, 0 failed)
  • bash scripts/controller-proxy-auth-smoke-test.sh — pass (expected proxy auth behavior: 401 without token, success with token)
  • Additional manual verification details (if any)

Visual Changes

N/A

Reviewer Notes

  • This PR is backend/controller-only; no UI changes.
  • worker-configuration.d.ts regeneration changes were intentionally excluded from this branch.
  • Auth path for /api/controller/checkin is intentionally custom and mounted before JWT/internal API middleware.
  • Network stats parser prefers eth0, then falls back to summing non-loopback interfaces.
  • Detailed implementation deviations/deferrals are logged at ~/fd-plans/kiloclaw/controller-telemetry-deviations.md.

@pandemicsyn pandemicsyn marked this pull request as ready for review March 23, 2026 02:34
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 23, 2026

Code Review Summary

Status: 3 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 3
SUGGESTION 0
Issue Details (click to expand)

N/A

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx 399 Volume reassociation updates React state during render when the machine returns to running
src/app/payments/topup/route.ts 68 Organization top-up checkout still skips authorization for caller-supplied organization-id
src/routers/kiloclaw-router.ts 1433 In-place auto-schedule reuse is not rolled back if the DB write fails
Files Reviewed (3 files)
  • kiloclaw/src/routes/controller.ts - 0 issues
  • kiloclaw/src/test-utils.ts - 0 issues
  • kiloclaw/src/types.ts - 0 issues

Fix these issues in Kilo Cloud


Reviewed by gpt-5.4-20260305 · 227,360 tokens

previousRestarts = stats.restarts;
previousNetStats = currentNetStats;
} catch (err) {
console.error('[checkin] failed:', err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to phone home about whatever went wrong? I'm assuming we don't consume the controller logs in Axiom, so I'm wondering if we should set up a public Sentry DSN and just send a simple message to gain some visibility.

@pandemicsyn pandemicsyn merged commit 67bbe7b into main Mar 23, 2026
18 checks passed
@pandemicsyn pandemicsyn deleted the florian/chore/controller-telemetry branch March 23, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants