feat: add CLI shell and exec commands for pod terminal access#124
feat: add CLI shell and exec commands for pod terminal access#124
Conversation
0073b53 to
095bd1e
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| @click.option("--pod", default=None, help="Specify a pod name") | ||
| @click.option("--shell", "shell_type", default=None, type=click.Choice(["bash", "sh", "zsh"]), help="Shell type") | ||
| @click.option( | ||
| "--first-pod", is_flag=True, default=False, help="Auto-select the first running pod (skip interactive selection)" |
There was a problem hiding this comment.
I don't think we need this right? if --pod is not provided, then we default to first pod
There was a problem hiding this comment.
If there are multiple replicas, then we interactively ask the user for a pod. --first-pod ensures that we apply the command to the first pod
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughAdds WebSocket-based interactive and non-interactive shell commands to the CLI, SDK session utilities and exceptions, a v3 status API method, new runtime deps, and comprehensive tests for pod resolution, WS sessions, and terminal rendering. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as "CLI Shell"
participant API as "CentML API"
participant WS as "WebSocket / Pod"
User->>CLI: Invoke shell/exec command
CLI->>API: Query deployment status / running pods
API-->>CLI: Pod list
alt Single pod or --first-pod
CLI->>CLI: Auto-select pod
else Multiple pods
CLI->>User: Prompt for pod
User-->>CLI: Select pod
end
CLI->>API: Request auth token
API-->>CLI: Token
CLI->>CLI: Build WS URL
CLI->>WS: Connect with token
WS-->>CLI: Connection established
alt Interactive
CLI->>WS: Resize PTY
rect rgba(100, 150, 200, 0.5)
loop Bidirectional I/O
User->>CLI: stdin
CLI->>WS: send stdin frames
WS->>CLI: stdout/stderr frames
CLI->>User: display output
end
end
else Exec (non-interactive)
CLI->>WS: Resize + send wrapped command (with markers)
rect rgba(100, 150, 200, 0.5)
WS->>CLI: Output including BEGIN/END markers
CLI->>CLI: Parse output, extract exit code
end
CLI->>User: show output and exit code
end
CLI->>WS: Close connection
WS-->>CLI: Closed
CLI->>User: Return exit code
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/test_sdk_shell_session.py (1)
315-323: Consider explicitly managing the background shutdown task.The fire-and-forget task pattern here works, but it's implicit. While Python 3.10+ guarantees the task won't be collected, explicitly retaining a reference and cleaning it up in a
finallyblock (or using a context manager) would make the intent clearer and reduce the risk of subtle issues if the code evolves.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_sdk_shell_session.py` around lines 315 - 323, The test starts a fire-and-forget task via asyncio.create_task(_set_shutdown()) without keeping a reference; change the _run coroutine to store the task (e.g., shutdown_task = asyncio.create_task(_set_shutdown())), then wrap the call to forward_io(ws, [80, 24], shutdown) in a try/finally where in finally you cancel the shutdown_task (if still running) and await it to ensure cleanup; reference symbols to edit: _run, _set_shutdown, shutdown, forward_io, and the asyncio.create_task call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@centml/cli/shell.py`:
- Around line 41-56: The _connect_args function can block by calling _select_pod
(which uses click.prompt) when multiple pods exist; modify _connect_args to
avoid prompting when stdin is not a TTY: import sys and before calling
_select_pod check sys.stdin.isatty(), and if it's False and neither pod nor
first_pod is provided raise click.ClickException instructing the caller to pass
--pod or --first-pod; keep the existing behavior (call _select_pod) only when
sys.stdin.isatty() is True. Ensure this logic is applied inside the
_connect_args function surrounding the branch that currently sets pod_name =
_select_pod(running_pods, deployment_id).
- Around line 89-91: The exec_cmd function currently concatenates the
Click-provided command tuple with " ".join(command), which loses
quoting/escaping; replace that with shlex.join(command) and ensure shlex is
imported so exec_cmd calls asyncio.run(exec_session(ws_url, token,
shlex.join(command))) to preserve argument boundaries for POSIX shells (refer to
exec_cmd and exec_session symbols).
In `@centml/sdk/shell/session.py`:
- Around line 226-229: The code currently initializes exit_code = 0 and silently
falls through on ConnectionClosed, causing a false "success" when END_MARKER
wasn't received; update the ConnectionClosed handling in
centml/sdk/shell/session.py to treat a socket close before detecting END_MARKER
as a transport failure by either setting exit_code to a non-zero value (e.g., 1)
or raising an exception (e.g., RuntimeError("socket closed before END_MARKER")),
and ensure this logic checks the buffer/is_done/END_MARKER state so you only
mark failure when the trailer was not parsed; apply the same change to the
second occurrence referenced (around lines 258-260) so both code paths return
non-zero or raise on missing END_MARKER.
- Around line 11-12: The code calls websockets.connect(...) with
additional_headers (in centml/sdk/shell/session.py at the two call sites that
currently use websockets.connect around lines 162 and 208) which is only
supported by the new asyncio client; change the import to use the asyncio client
explicitly by importing connect from websockets.asyncio.client and replace
websockets.connect(...) calls with connect(...), or alternatively raise the
minimum requirement to websockets>=14.0 in requirements so the existing import
"import websockets" remains valid—pick one approach and apply it consistently to
both call sites.
In `@tests/test_cli_shell.py`:
- Around line 43-68: The test helper _patch_deps currently only patches
asyncio.run, causing real coroutine objects from interactive_session and
exec_session to be created (leading to un-awaited coroutine warnings and
preventing assertions); update _patch_deps to also patch
centml.cli.shell.interactive_session and centml.cli.shell.exec_session (or have
the patched asyncio.run explicitly await/close a fake coroutine) so tests can
inspect calls and avoid warnings; ensure the patched names (interactive_session,
exec_session) return a simple mock/coroutine that records the URL, shell type,
and command used by shell() and exec_cmd() for assertions.
---
Nitpick comments:
In `@tests/test_sdk_shell_session.py`:
- Around line 315-323: The test starts a fire-and-forget task via
asyncio.create_task(_set_shutdown()) without keeping a reference; change the
_run coroutine to store the task (e.g., shutdown_task =
asyncio.create_task(_set_shutdown())), then wrap the call to forward_io(ws, [80,
24], shutdown) in a try/finally where in finally you cancel the shutdown_task
(if still running) and await it to ensure cleanup; reference symbols to edit:
_run, _set_shutdown, shutdown, forward_io, and the asyncio.create_task call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1295c58d-f50f-4796-812b-7dac84fb274d
📒 Files selected for processing (11)
centml/cli/main.pycentml/cli/shell.pycentml/sdk/api.pycentml/sdk/shell/__init__.pycentml/sdk/shell/exceptions.pycentml/sdk/shell/session.pyrequirements.txttests/conftest.pytests/test_cli_shell.pytests/test_sdk_shell_renderer.pytests/test_sdk_shell_session.py
| exit_code = 0 | ||
| buffer = "" | ||
| is_capturing = False | ||
| is_done = False |
There was a problem hiding this comment.
Don't report success when the socket closes before END_MARKER.
exit_code starts at 0, and the ConnectionClosed path just falls through. If the pod or proxy disconnects before the trailer is parsed, centml cluster exec returns success even though the command outcome is unknown. Please treat “no end marker received” as a transport failure and return non-zero, or raise instead.
Suggested fix
- exit_code = 0
+ exit_code = None
@@
- except websockets.ConnectionClosed:
- pass
- return exit_code
+ except websockets.ConnectionClosed:
+ if not is_done:
+ sys.stderr.write("Error: terminal session closed before command completion\n")
+ return 1
+ return exit_code if is_done and exit_code is not None else 1Also applies to: 258-260
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@centml/sdk/shell/session.py` around lines 226 - 229, The code currently
initializes exit_code = 0 and silently falls through on ConnectionClosed,
causing a false "success" when END_MARKER wasn't received; update the
ConnectionClosed handling in centml/sdk/shell/session.py to treat a socket close
before detecting END_MARKER as a transport failure by either setting exit_code
to a non-zero value (e.g., 1) or raising an exception (e.g.,
RuntimeError("socket closed before END_MARKER")), and ensure this logic checks
the buffer/is_done/END_MARKER state so you only mark failure when the trailer
was not parsed; apply the same change to the second occurrence referenced
(around lines 258-260) so both code paths return non-zero or raise on missing
END_MARKER.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
centml/cli/shell.py (2)
41-60:⚠️ Potential issue | 🟠 MajorGuard against prompting in non-interactive contexts.
exec_cmdcan be invoked non-interactively (CI, scripts), but_connect_argsmay call_select_podwhich usesclick.prompt(). This will hang or fail when multiple pods exist and--pod/--first-podis not specified. Consider requiring--podor--first-podwhen stdin is not a TTY, or defaulting to first pod in non-interactive mode.🛠️ Proposed fix
def _connect_args(deployment_id, pod, shell_type, first_pod=False): """Resolve pod, build WebSocket URL, and obtain auth token.""" with get_centml_client() as cclient: running_pods = get_running_pods(cclient, deployment_id) if not running_pods: raise click.ClickException(f"No running pods found for deployment {deployment_id}") if pod is not None: try: pod_name = _resolve_pod(running_pods, pod) except ShellError as exc: raise click.ClickException(str(exc)) from exc elif len(running_pods) == 1 or first_pod: pod_name = running_pods[0] + elif not sys.stdin.isatty(): + raise click.ClickException( + "Multiple pods found. Specify --pod or --first-pod for non-interactive use." + ) else: pod_name = _select_pod(running_pods, deployment_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@centml/cli/shell.py` around lines 41 - 60, The _connect_args function can block by calling _select_pod (which uses click.prompt) in non-interactive runs; fix by detecting interactivity (e.g. using sys.stdin.isatty()) and if not interactive and pod is None and not first_pod and len(running_pods) > 1, do not call _select_pod: instead raise a click.ClickException telling the user to provide --pod or --first-pod (or alternatively choose to default to running_pods[0] if you prefer non-interactive default behavior); implement this check inside _connect_args before calling _select_pod so build_ws_url and auth.get_centml_token remain unchanged.
80-92:⚠️ Potential issue | 🟠 MajorUse
shlex.join()to preserve command argument boundaries.
" ".join(command)loses quoting and escaping. Arguments like"hello world",$HOME, or*will be misinterpreted by the pod shell. Useshlex.join(command)for correct POSIX shell escaping.🛠️ Proposed fix
+import shlex ... def exec_cmd(deployment_id, command, pod, shell_type, first_pod): ws_url, token = _connect_args(deployment_id, pod, shell_type, first_pod) - exit_code = asyncio.run(exec_session(ws_url, token, " ".join(command))) + exit_code = asyncio.run(exec_session(ws_url, token, shlex.join(command))) sys.exit(exit_code)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@centml/cli/shell.py` around lines 80 - 92, The pod exec command concatenates arguments with " ".join(command) which loses quoting/escaping; change exec_cmd to construct the shell string with shlex.join(command) before calling exec_session so argument boundaries are preserved (update the call in exec_cmd that currently does asyncio.run(exec_session(ws_url, token, " ".join(command))) to use shlex.join(command)); ensure you import shlex at top of centml/cli/shell.py if not already present.
🧹 Nitpick comments (4)
centml/sdk/shell/__init__.py (1)
4-12: Consider sorting__all__for consistency.Ruff (RUF022) flags that
__all__is not sorted. Sorting improves maintainability and aligns with isort-style conventions.♻️ Proposed fix
__all__ = [ - "ShellError", "NoPodAvailableError", "PodNotFoundError", + "ShellError", "build_ws_url", + "exec_session", "get_running_pods", "interactive_session", - "exec_session", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@centml/sdk/shell/__init__.py` around lines 4 - 12, The __all__ list is unsorted (RUF022); sort the exported names alphabetically to satisfy Ruff and maintain consistency — reorder the entries in __all__ so identifiers like "NoPodAvailableError", "PodNotFoundError", "ShellError", "build_ws_url", "exec_session", "get_running_pods", "interactive_session" appear in ascending lexicographic order (update the __all__ variable in centml/sdk/shell/__init__.py where it's defined).tests/test_sdk_shell_session.py (1)
315-323: Store the task reference to prevent potential garbage collection.The
asyncio.create_task()return value is not stored, which could theoretically allow the task to be garbage collected before completion. While this test likely works in practice, storing the reference is the recommended pattern.♻️ Suggested fix
async def _run(): async def _set_shutdown(): await asyncio.sleep(0.1) shutdown.set() - asyncio.create_task(_set_shutdown()) + _ = asyncio.create_task(_set_shutdown()) return await forward_io(ws, [80, 24], shutdown)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_sdk_shell_session.py` around lines 315 - 323, The create_task call in the local coroutine _run uses asyncio.create_task(_set_shutdown()) without storing its return value, which risks the task being garbage-collected; assign the created Task to a local variable (e.g., shutdown_task or setter_task) inside _run so the reference lives at least until forward_io(ws, [80, 24], shutdown) completes (optionally await or let it finish naturally), keeping function names _run and _set_shutdown unchanged.centml/sdk/shell/session.py (2)
127-136: Consider narrowing the exception catch during task cleanup.The
except (asyncio.CancelledError, Exception)catch is broad and silently suppresses all exceptions from cancelled tasks. While this is common in cleanup code, it could mask unexpected errors. Consider catching specific expected exceptions or at minimum logging unexpected ones.♻️ Suggested improvement
for t in pending: try: await t - except (asyncio.CancelledError, Exception): - pass + except asyncio.CancelledError: + pass # Expected for cancelled tasks + except websockets.ConnectionClosed: + pass # Expected during shutdown + except Exception: + pass # Suppress other errors during cleanup🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@centml/sdk/shell/session.py` around lines 127 - 136, The cleanup loop currently swallows all exceptions with "except (asyncio.CancelledError, Exception): pass", which can hide real errors; update the loop that awaits tasks in "pending" to handle asyncio.CancelledError separately and to catch Exception as "e" and log it (e.g. using the module/session logger) instead of silencing it, so reference the variables pending, done, and t and replace the broad except with a specific asyncio.CancelledError branch and an Exception branch that logs the unexpected exception details.
164-168: Unhandled exceptions from fire-and-forget coroutine.
asyncio.ensure_future(ws.send(...))schedules the send but doesn't handle potential exceptions. If the WebSocket is closed when a resize signal arrives, this could produce unhandled exception warnings.♻️ Suggested fix with exception handling
def _send_resize(): c, r = shutil.get_terminal_size() term_size[0], term_size[1] = c, r - asyncio.ensure_future(ws.send(json.dumps({"operation": "resize", "rows": r, "cols": c}))) + async def _do_resize(): + try: + await ws.send(json.dumps({"operation": "resize", "rows": r, "cols": c})) + except websockets.ConnectionClosed: + pass # Connection already closed, ignore resize + asyncio.ensure_future(_do_resize())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@centml/sdk/shell/session.py` around lines 164 - 168, The _send_resize closure currently calls asyncio.ensure_future(ws.send(...)) which can raise unhandled exceptions if the websocket is closed; change this to schedule a coroutine that catches and logs or ignores exceptions from ws.send. Create an async helper (e.g., async def _safe_send_resize(rows, cols): try: await ws.send(...); except Exception as e: process or debug-log the error) and call asyncio.create_task(_safe_send_resize(r, c)) inside _send_resize so resize failures are handled gracefully; reference _send_resize, ws.send, term_size, and the existing asyncio scheduling call when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@centml/cli/shell.py`:
- Around line 41-60: The _connect_args function can block by calling _select_pod
(which uses click.prompt) in non-interactive runs; fix by detecting
interactivity (e.g. using sys.stdin.isatty()) and if not interactive and pod is
None and not first_pod and len(running_pods) > 1, do not call _select_pod:
instead raise a click.ClickException telling the user to provide --pod or
--first-pod (or alternatively choose to default to running_pods[0] if you prefer
non-interactive default behavior); implement this check inside _connect_args
before calling _select_pod so build_ws_url and auth.get_centml_token remain
unchanged.
- Around line 80-92: The pod exec command concatenates arguments with "
".join(command) which loses quoting/escaping; change exec_cmd to construct the
shell string with shlex.join(command) before calling exec_session so argument
boundaries are preserved (update the call in exec_cmd that currently does
asyncio.run(exec_session(ws_url, token, " ".join(command))) to use
shlex.join(command)); ensure you import shlex at top of centml/cli/shell.py if
not already present.
---
Nitpick comments:
In `@centml/sdk/shell/__init__.py`:
- Around line 4-12: The __all__ list is unsorted (RUF022); sort the exported
names alphabetically to satisfy Ruff and maintain consistency — reorder the
entries in __all__ so identifiers like "NoPodAvailableError",
"PodNotFoundError", "ShellError", "build_ws_url", "exec_session",
"get_running_pods", "interactive_session" appear in ascending lexicographic
order (update the __all__ variable in centml/sdk/shell/__init__.py where it's
defined).
In `@centml/sdk/shell/session.py`:
- Around line 127-136: The cleanup loop currently swallows all exceptions with
"except (asyncio.CancelledError, Exception): pass", which can hide real errors;
update the loop that awaits tasks in "pending" to handle asyncio.CancelledError
separately and to catch Exception as "e" and log it (e.g. using the
module/session logger) instead of silencing it, so reference the variables
pending, done, and t and replace the broad except with a specific
asyncio.CancelledError branch and an Exception branch that logs the unexpected
exception details.
- Around line 164-168: The _send_resize closure currently calls
asyncio.ensure_future(ws.send(...)) which can raise unhandled exceptions if the
websocket is closed; change this to schedule a coroutine that catches and logs
or ignores exceptions from ws.send. Create an async helper (e.g., async def
_safe_send_resize(rows, cols): try: await ws.send(...); except Exception as e:
process or debug-log the error) and call
asyncio.create_task(_safe_send_resize(r, c)) inside _send_resize so resize
failures are handled gracefully; reference _send_resize, ws.send, term_size, and
the existing asyncio scheduling call when making the change.
In `@tests/test_sdk_shell_session.py`:
- Around line 315-323: The create_task call in the local coroutine _run uses
asyncio.create_task(_set_shutdown()) without storing its return value, which
risks the task being garbage-collected; assign the created Task to a local
variable (e.g., shutdown_task or setter_task) inside _run so the reference lives
at least until forward_io(ws, [80, 24], shutdown) completes (optionally await or
let it finish naturally), keeping function names _run and _set_shutdown
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 43be36f8-9df6-44f3-8f76-1d891c236df8
📒 Files selected for processing (11)
centml/cli/main.pycentml/cli/shell.pycentml/sdk/api.pycentml/sdk/shell/__init__.pycentml/sdk/shell/exceptions.pycentml/sdk/shell/session.pyrequirements.txttests/conftest.pytests/test_cli_shell.pytests/test_sdk_shell_renderer.pytests/test_sdk_shell_session.py
Add two new commands under `centml cluster`:
- `shell <id>` -- interactive terminal session (like docker exec -it)
- `exec <id> -- <command>` -- run a command and return output (like ssh host "cmd")
Both connect via WebSocket through the Platform API terminal proxy,
matching the same protocol used by the Web UI TerminalView.
Replaces str.replace("https://", "wss://") with urllib.parse.urlparse
scheme replacement to avoid CodeQL py/incomplete-url-substring-sanitization.
Replace url.startswith() assertions with urllib.parse.urlparse() checks to satisfy CodeQL py/incomplete-url-substring-sanitization rule. Reformat both shell.py and test_shell.py with black.
Add pyte as local terminal emulator (equivalent to xterm.js) to solve cursor positioning and line wrapping issues. Feed WebSocket output through pyte Screen/Stream and render only dirty lines with ANSI escape sequences.
shutil.get_terminal_size() returns (columns, lines), not (rows, cols). The swapped unpacking caused pyte Screen to be created with terminal line count as width, making the display extremely narrow.
…andling Replace regex-based _strip_ansi with pyte single-row screen for marker detection. pyte interprets all VT100/VT220 sequences including OSC and truecolor escapes that the regex could miss.
If two Code signals arrive within 3 seconds, the shell has exited and the reconnect just opened a new session. Exit cleanly instead of looping forever.
Logs to /tmp/centml_shell_debug.log (overridable via CENTML_SHELL_DEBUG_LOG env var). Traces every WS message, stdin event, task lifecycle, reconnect decision, and connection close.
The platform API proxy never forwards ArgoCD Code messages and does not close the WebSocket when the remote shell exits. Replace the Code/reconnect logic with exit echo detection: when the server echoes back "exit\r\n", arm a 2-second idle timeout on ws.recv(). If no more data arrives, the shell has exited -- break out cleanly. Also removes Code handling from _exec_session (markers already work).
…ompt When "exit\r\n" appears at the end of ws data (nothing after it), the shell has exited -- return immediately instead of waiting 2s. When "exit\r\n" is followed by a new prompt (e.g. from echo exit), ignore it and continue the session.
3ff11ee to
78cfb45
Compare
Summary
centml cluster shell <id>for interactive terminal sessions (likedocker exec -it)centml cluster exec <id> -- <command>for running commands and returning output (likessh host "cmd")get_status_v3()to SDK client for pod discoverywebsockets>=13.0dependencySummary by CodeRabbit
New Features
shellandexeccluster subcommands for interactive and non-interactive pod terminal access.get_status_v3()API method for deployment status retrieval.Chores
Tests