Skip to content

Distinguish container restarts from exits in monitor#3937

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix-docker-restart
Open

Distinguish container restarts from exits in monitor#3937
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix-docker-restart

Conversation

@gkatz2
Copy link
Contributor

@gkatz2 gkatz2 commented Feb 23, 2026

Summary

  • Add ErrContainerRestarted sentinel error to distinguish Docker restarts from true container exits
  • Update the transport's handleContainerExit to reconnect the monitor and keep the proxy running when a restart is detected
  • Fix double-close panic on shutdownCh and add mutex protection in reconnectMonitor

Fixes #3936

Context

The container monitor detects Docker restarts by comparing StartedAt times (#3843), but reports them with the same ErrContainerExited error as real exits. The transport layer can't tell them apart, so it always tears down the proxy and kills the already-running container. This causes cascading restart loops that take MCP servers offline — see #3936 for detailed impact.

Changes

  • pkg/container/runtime/errors.go: New ErrContainerRestarted sentinel
  • pkg/container/runtime/monitor.go: Send ErrContainerRestarted when start time changes
  • pkg/container/docker/errors.go: Deprecated alias for backward compatibility
  • pkg/transport/http.go: Loop-based handleContainerExit that reconnects the monitor on restart; reconnectMonitor with mutex protection; double-close guard in Stop()
  • pkg/transport/http_test.go: Test case for ShouldRestart with ErrContainerRestarted
  • pkg/transport/stdio.go: ShouldRestart excludes ErrContainerRestarted

Test plan

  • Unit tests pass (go test ./pkg/container/runtime/... ./pkg/transport/...)
  • Lint passes
  • Manual test: started a server with patched binary, made MCP tool calls (initialize + tools/call), ran docker restart, confirmed proxy stayed up and MCP tool calls succeeded post-restart
  • Manual test: docker stop still correctly tears down the transport

The container monitor sends ErrContainerExited for both true exits
and Docker restarts (detected via StartedAt change). The transport
layer has no way to tell them apart, so it always tears down the
proxy — killing a healthy container that Docker already restarted.

Add ErrContainerRestarted sentinel error so the transport can handle
restarts differently: reconnect the monitor to track the new start
time while keeping the proxy running.

Fixes stacklok#3936

Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Feb 23, 2026
@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 9.43396% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.64%. Comparing base (057d93d) to head (6a2b59a).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/transport/http.go 4.00% 48 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3937      +/-   ##
==========================================
+ Coverage   67.62%   67.64%   +0.01%     
==========================================
  Files         438      438              
  Lines       44108    44141      +33     
==========================================
+ Hits        29829    29857      +28     
- Misses      11945    11950       +5     
  Partials     2334     2334              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToolHive kills containers restarted by Docker restart policy

1 participant