HostManager: Improve handling of docker containers#463
Conversation
- ManagedInstance is a dataclass instead of a dict - values from HostManager.database are always called "minst" (short for ManagedInstance).
This means HostManager will not enforce state on unrecognized services, but will still permit user to bring them up and down (or attempt to do so).
…ervices Calling die(disown_dockers=True) will cause HostManager to put all running docker services into passive_tracking mode so they are not brought down on exit. As long as manage=docker/*up*, the services will be picked up and not messed with when HostManager next starts up.
They always take 10 seconds to exit.
|
Will require an ocs-web update to take advantage of new features -- will link here with a couple screen shots when available. |
|
Here's the updated panel (coming together in simonsobs/ocs-web#88) -- indicator lights correspond to needing a docker pull and restart one agent.
|
BrianJKoopman
left a comment
There was a problem hiding this comment.
This looks great!
I did find one small thing during testing, which maybe you can comment on.
If I ensure there is an orphan first, then I go to command a non-orphan docker container, bringing it down and then back up. On the up it has to wait for the few seconds (10?) exit of the orphan from the new --remove-orphans flag in the docker compose up command. This generates what looks like a failed start of the agent I'm trying to command. After this initial apparent failed launch, the orphan has been removed, and the second launch succeeds.
Here are some logs, starting from an update call. I then remove fake-data1 from the compose file (making it an orphan), then command fake-data3 down and back up:
2026-03-31T16:47:20-0400 start called for update
2026-03-31T16:47:20-0400 update:19 Status is now "starting".
2026-03-31T16:47:20-0400 update:19 Status is now "running".
2026-03-31T16:47:20-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-03-31T16:47:20-0400 update:19 Update requested.
2026-03-31T16:47:20-0400 update:19 Status is now "done".
2026-03-31T16:47:37-0400 manager:0 Marking missing service for fake-data1
2026-03-31T16:48:47-0400 start called for update
2026-03-31T16:48:47-0400 update:20 Status is now "starting".
2026-03-31T16:48:47-0400 update:20 Status is now "running".
2026-03-31T16:48:47-0400 update:20 Update requested.
2026-03-31T16:48:47-0400 update:20 Status is now "done".
2026-03-31T16:48:47-0400 manager:0 Requesting termination of FakeDataAgent:fake-data3
2026-03-31T16:48:55-0400 manager:0 Agent instance FakeDataAgent:fake-data3 has exited
2026-03-31T16:48:59-0400 start called for update
2026-03-31T16:48:59-0400 update:21 Status is now "starting".
2026-03-31T16:48:59-0400 update:21 Status is now "running".
2026-03-31T16:48:59-0400 update:21 Update requested.
2026-03-31T16:48:59-0400 update:21 Status is now "done".
2026-03-31T16:49:00-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:01-0400 manager:0 Launched FakeDataAgent:fake-data3
2026-03-31T16:49:02-0400 manager:0 Detected exit of FakeDataAgent:fake-data3 with code 127.
2026-03-31T16:49:07-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:08-0400 manager:0 Launched FakeDataAgent:fake-data3
Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?
This is actually quite tricky to improve. And it's barely broken... but I'll look into it. |
Yeah, if it's not simple I'm not super concerned. I'd say it would be a fairly common occurrence though when an agent does get removed in the config file without bringing them down first, unless the user who removes the agent knows to go run |
This simplifies state machine logic. Extended start-up timeout to 15 seconds -- this should be enough time for docker "up" to remove orphans first.
Should be fixed in latest push. |
There was a problem hiding this comment.
Great, thanks for those changes!
One more question. I'm finding if I add a new docker managed agent (really I'm removing one and then re-adding it) it gets picked up as 'passive' only. It gets fully picked up if I restart the HM. Is that expected?
Either way, I'm happy with this as is. Let me know what you think and if you plan to make any more changes or if we should address later if at all.
Did you reload the SCF, with "Refresh config" button? Without that, it won't know the new docker service is for an agent. Perhaps annoying that it rescans compose files constantly, but SCF only on demand... |
Ah, that does work, but only on the second run of 'update'. |
|
Some logs after adding the agent back in: At this point it's still marked as 'passive'. Then I run update again: Now it's as expected, a fully managed agent. |
|
Wow, thorough testing. Should be fixed now! |
BrianJKoopman
left a comment
There was a problem hiding this comment.
Looks good, thanks again for the updates!

Description
Improves overall HostManager handling of docker containers. Specifically:
Improvements as part of above:
Motivation and Context
This work was spurred by the desire for the "pull_dockers" feature, but led to a bit of an overhaul for a bunch of little quality of life things.
The one "breaking" change here is that non-agent docker instances are now passively managed -- they won't be brought down when HostManager exits. So if that behavior is relied upon, this could be a problem. In SO we carefully separate those services into separate compose files. If need be an agent arg could be added to select between old and new behavior.
Resolves #363.
How Has This Been Tested?
Manually, cycling over the various things that can be weird in a docker-based ocs (changing service names, inconsistencies between compose and scf, etc).
Types of changes
Checklist: