feat: resume intent detection with HTTP 410#11
Conversation
1.Client ?resume=true + expired session → HTTP 410 + ResumeAttempt(false) 2.Hub race: session expires mid-upgrade → WS close 4100, no new session 3.Re-export CloseSessionExpired from core Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds explicit “resume intent” handling to distinguish reconnect/resume attempts from new sessions, improving failed-resume observability and avoiding unnecessary upgrades when a resume is doomed.
Changes:
- Add
?resume=truepre-check inServeHTTPto reject expired resumption attempts with HTTP 410 Gone and emitResumeAttempt(..., false). - Handle the hub race case by rejecting the connection with WS close 4100 (
CloseSessionExpired) instead of creating a new session. - Update metrics/docs/changelog and add an integration test for the HTTP 410 rejection path.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
types.go |
Re-exports CloseSessionExpired close code from core. |
server.go |
Implements HTTP-layer resume intent pre-check and threads resumeIntent into hub registration. |
hub.go |
Adds resumeIntent to registerMessage and handles the race-case rejection with close 4100. |
metrics.go |
Updates ResumeAttempt GoDoc to define success=false semantics and scenarios. |
metrics_integration_test.go |
Adds integration coverage for HTTP 410 + ResumeAttempt(false) emission. |
doc/internals.md |
Documents the resume-intent flow and race-case behavior. |
CHANGELOG.md |
Records the new behavior and exported close code. |
| if connectionID == "" { | ||
| connectionID = uuid.NewString() | ||
| } | ||
|
|
||
| resumeIntent := r.URL.Query().Get("resume") == "true" |
There was a problem hiding this comment.
ServeHTTP assigns a random UUID when ConnectFunc returns an empty connectionID, but resumeIntent is evaluated after that. If a client sends ?resume=true while the server-generated ID path is used, this will always fail the pre-check and return HTTP 410 even though the server cannot meaningfully “resume” without a deterministic connectionID. Consider only honoring resume intent when ConnectFunc provided a non-empty connectionID (or clear resumeIntent when you auto-generate the ID).
| _ = message.transport.WriteMessage( | ||
| websocket.CloseMessage, | ||
| websocket.FormatCloseMessage(CloseSessionExpired, "session expired"), | ||
| ) | ||
| _ = message.transport.Close() |
There was a problem hiding this comment.
handleRegister writes a WebSocket close frame directly from the hub event loop. Network I/O here can block the hub goroutine (and delay processing for all sessions) if the peer is slow or the write stalls. Consider moving the close-frame write/transport close into a short-lived goroutine and/or using a write deadline (e.g., WriteControl with a deadline) to keep the hub loop non-blocking.
| _ = message.transport.WriteMessage( | |
| websocket.CloseMessage, | |
| websocket.FormatCloseMessage(CloseSessionExpired, "session expired"), | |
| ) | |
| _ = message.transport.Close() | |
| // Send a close frame and close the transport outside the hub loop to | |
| // avoid blocking the hub goroutine on network I/O. | |
| go func(conn *websocket.Conn) { | |
| deadline := time.Now().Add(5 * time.Second) | |
| _ = conn.WriteControl( | |
| websocket.CloseMessage, | |
| websocket.FormatCloseMessage(CloseSessionExpired, "session expired"), | |
| deadline, | |
| ) | |
| _ = conn.Close() | |
| }(message.transport) |
| zap.String("room_id", message.roomID), | ||
| ) | ||
| h.config.metrics.ResumeAttempt(message.roomID, message.connectionID, false) |
There was a problem hiding this comment.
In the hub race-case failure path, ResumeAttempt(false) (and the log field room_id) use message.roomID from ConnectFunc. But ConnectFunc’s roomID is documented as ignored for resumption, so it may not match the room of the session the client was trying to resume. Consider capturing the original session roomID during the ServeHTTP pre-check (when the session exists) and carrying that into registerMessage so metrics/logging for the race-case are attributed to the correct room.
| zap.String("room_id", message.roomID), | |
| ) | |
| h.config.metrics.ResumeAttempt(message.roomID, message.connectionID, false) | |
| ) | |
| // The original session has already been destroyed; we no longer know its | |
| // room. Avoid using message.roomID from ConnectFunc (ignored for resume) | |
| // to prevent misattributing metrics to the wrong room. | |
| h.config.metrics.ResumeAttempt("", message.connectionID, false) |
| // ResumeAttempt is called when a client attempts to resume a suspended session. | ||
| // success is true when the session was successfully resumed (transport swap), | ||
| // and false when the session no longer exists (grace window expired). | ||
| // | ||
| // A failed resume (success=false) is emitted in two scenarios: | ||
| // - ServeHTTP pre-check: client connects with ?resume=true but | ||
| // hub.get(connectionID) returns nil. HTTP 410 is returned. | ||
| // - Hub race case: pre-check passed but the session expired between |
There was a problem hiding this comment.
The MetricsCollector interface header comment says hooks are called from the hub/readPump/writePump goroutines, but ResumeAttempt(success=false) is now also invoked from the ServeHTTP goroutine (pre-check). Please update the interface-level documentation to include ServeHTTP so implementers don’t miss this call site/concurrency context.
| // Race case: the HTTP pre-check saw the session, but by the time the | ||
| // hub processes the registerMessage the grace timer has fired and | ||
| // destroyed it. The client wanted to resume but the session is gone. | ||
| // Reject with WS close 4100 (CloseSessionExpired) — do not create a | ||
| // new session, consistent with the HTTP 410 rejection in ServeHTTP. |
There was a problem hiding this comment.
The hub race-case behavior (resumeIntent=true, session removed between HTTP pre-check and hub handling) now rejects the connection (WS close 4100) instead of creating a new session, but there’s no test covering this path. Since it’s timing-sensitive, adding an integration/unit test with controllable clock/synchronization would help prevent regressions.
|
Superseded — success bool parameter removed entirely instead of implementing resume intent detection. See refactor/remove-resume-success-bool branch. |
Summary
?resume=trueafter grace window expiry are rejected with HTTP 410 Gone (no WebSocket upgrade), andResumeAttempt(roomID, connectionID, false)is emittedCloseSessionExpired) instead of creating a new sessionCloseSessionExpired(4100) re-exported from core;ResumeAttemptGoDoc updated to reflectsuccess=falsesemanticsTest plan
TestIntegration_MetricsCollector_ResumeAttempt_Rejected— connect, disconnect, grace expiry, reconnect with?resume=true→ HTTP 410, ResumeAttempt(false) emitted, no ConnectionOpenedTestIntegration_MetricsCollector_ResumeAttempt(success path) still passes🤖 Generated with Claude Code