Skip to content

Fix RemoteDisconnected errors from stale pooled connections#409

Open
roxanne-o wants to merge 1 commit intomainfrom
fix-stale-connection-retry
Open

Fix RemoteDisconnected errors from stale pooled connections#409
roxanne-o wants to merge 1 commit intomainfrom
fix-stale-connection-retry

Conversation

@roxanne-o
Copy link
Collaborator

Summary

  • Root cause: When the SDK reuses an HTTP connection that's been idle for ~60s (e.g. between rate-limited image submissions), the server-side infrastructure (ALB default idle timeout = 60s, nginx default proxy_read_timeout = 60s) may have already closed it. urllib3 then raises RemoteDisconnected which was neither retried nor caught.
  • Add a default urllib3 Retry policy (total=3, with backoff) so stale connections are transparently re-established at the transport layer. Users can still override via http_transport_retries.
  • Catch MaxRetryError and ProtocolError in rest.py so that if retries exhaust, the error surfaces as ApiException instead of a raw urllib3 exception.

Reproducer

uv run submit_images.py --folder ./photos --detector det_xxx --interval 60

After a few successful submissions, the next one fails with:

http.client.RemoteDisconnected: Remote end closed connection without response

Test plan

  • Existing test_create_groundlight_with_retries passes (explicit http_transport_retries still honored)
  • Existing test_http_retries.py 5xx retry tests pass (mock bypasses urllib3 Retry layer)
  • Manual test: submit images with 60s+ interval — no more RemoteDisconnected crashes

Made with Cursor

When the SDK reuses an HTTP connection that has been idle for ~60s, the
server-side infrastructure (ALB / nginx) may have already closed it.
urllib3 then raises RemoteDisconnected which was not retried or caught.

- Add a default urllib3 Retry policy (3 retries with backoff) so stale
  connections are transparently re-established at the transport layer.
- Catch MaxRetryError and ProtocolError in rest.py so exhausted retries
  surface as ApiException instead of raw urllib3 errors.
Comment on lines +221 to +226
except urllib3.exceptions.MaxRetryError as e:
msg = "{0}\n{1}".format(type(e).__name__, str(e))
raise ApiException(status=0, reason=msg)
except urllib3.exceptions.ProtocolError as e:
msg = "{0}\n{1}".format(type(e).__name__, str(e))
raise ApiException(status=0, reason=msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this manually added? This is a generated file so these changes will be wiped out every time it's re-generated.

read=3,
redirect=3,
backoff_factor=0.2,
allowed_methods=None, # retry all HTTP methods including POST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about retrying POST requests because that could lead to e.g. duplicate IQ submissions if the request makes it to the server but an error occurs when reading the response.

@paulina-positronix
Copy link
Contributor

Confirming this seems to have fixed my issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants