HostManager: Improve handling of docker containers by mhasself · Pull Request #463 · simonsobs/ocs

mhasself · 2026-03-30T15:05:25Z

Description

Improves overall HostManager handling of docker containers. Specifically:

HostManager keeps track of when services have changed tag, and whether or not that tag is available on the system. Also notices orphan services. All that is available in session.data.
New task to run "docker compose pull" on all managed compose files.
New task to remove orphans.
The handling of non-agent docker services is explicitly passive. You can put things up / down, but that state will not be enforced as a target state. This should allow non-agent services to co-exist in the same docker compose file, and be manipulated from the command line (or HostManager) without conflicts.
"die" task has "disown_dockers" parameter, which if passed will cause all the running dockerized agents tonot be brought down when HostManager exits. This can be used to restart HostManager without restarting the docker-based agents.

Improvements as part of above:

HostManager waits 11 seconds instead of 5 before complaining about a service not exiting... it always takes 10 seconds for services to exit, so this reduces log spam.
ManagedInstance changed from dict to dataclass, and nomenclature for an instance if such is now universally "minst" (was previously one of state, db, instance ...).

Motivation and Context

This work was spurred by the desire for the "pull_dockers" feature, but led to a bit of an overhaul for a bunch of little quality of life things.

The one "breaking" change here is that non-agent docker instances are now passively managed -- they won't be brought down when HostManager exits. So if that behavior is relied upon, this could be a problem. In SO we carefully separate those services into separate compose files. If need be an agent arg could be added to select between old and new behavior.

Resolves #363.

How Has This Been Tested?

Manually, cycling over the various things that can be weird in a docker-based ocs (changing service names, inconsistencies between compose and scf, etc).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

- ManagedInstance is a dataclass instead of a dict - values from HostManager.database are always called "minst" (short for ManagedInstance).

This means HostManager will not enforce state on unrecognized services, but will still permit user to bring them up and down (or attempt to do so).

…ervices Calling die(disown_dockers=True) will cause HostManager to put all running docker services into passive_tracking mode so they are not brought down on exit. As long as manage=docker/*up*, the services will be picked up and not messed with when HostManager next starts up.

They always take 10 seconds to exit.

mhasself · 2026-03-30T15:05:54Z

Will require an ocs-web update to take advantage of new features -- will link here with a couple screen shots when available.

mhasself · 2026-03-30T17:38:33Z

Here's the updated panel (coming together in simonsobs/ocs-web#88) -- indicator lights correspond to needing a docker pull and restart one agent.

BrianJKoopman

This looks great!

I did find one small thing during testing, which maybe you can comment on.

If I ensure there is an orphan first, then I go to command a non-orphan docker container, bringing it down and then back up. On the up it has to wait for the few seconds (10?) exit of the orphan from the new --remove-orphans flag in the docker compose up command. This generates what looks like a failed start of the agent I'm trying to command. After this initial apparent failed launch, the orphan has been removed, and the second launch succeeds.

Here are some logs, starting from an update call. I then remove fake-data1 from the compose file (making it an orphan), then command fake-data3 down and back up:

2026-03-31T16:47:20-0400 start called for update
2026-03-31T16:47:20-0400 update:19 Status is now "starting".
2026-03-31T16:47:20-0400 update:19 Status is now "running".
2026-03-31T16:47:20-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-03-31T16:47:20-0400 update:19 Update requested.
2026-03-31T16:47:20-0400 update:19 Status is now "done".
2026-03-31T16:47:37-0400 manager:0 Marking missing service for fake-data1
2026-03-31T16:48:47-0400 start called for update
2026-03-31T16:48:47-0400 update:20 Status is now "starting".
2026-03-31T16:48:47-0400 update:20 Status is now "running".
2026-03-31T16:48:47-0400 update:20 Update requested.
2026-03-31T16:48:47-0400 update:20 Status is now "done".
2026-03-31T16:48:47-0400 manager:0 Requesting termination of FakeDataAgent:fake-data3
2026-03-31T16:48:55-0400 manager:0 Agent instance FakeDataAgent:fake-data3 has exited
2026-03-31T16:48:59-0400 start called for update
2026-03-31T16:48:59-0400 update:21 Status is now "starting".
2026-03-31T16:48:59-0400 update:21 Status is now "running".
2026-03-31T16:48:59-0400 update:21 Update requested.
2026-03-31T16:48:59-0400 update:21 Status is now "done".
2026-03-31T16:49:00-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:01-0400 manager:0 Launched FakeDataAgent:fake-data3
2026-03-31T16:49:02-0400 manager:0 Detected exit of FakeDataAgent:fake-data3 with code 127.
2026-03-31T16:49:07-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:08-0400 manager:0 Launched FakeDataAgent:fake-data3

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

mhasself · 2026-04-01T14:08:00Z

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

BrianJKoopman · 2026-04-01T14:30:31Z

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

Yeah, if it's not simple I'm not super concerned.

I'd say it would be a fairly common occurrence though when an agent does get removed in the config file without bringing them down first, unless the user who removes the agent knows to go run remove_orphans() afterwards.

This simplifies state machine logic. Extended start-up timeout to 15 seconds -- this should be enough time for docker "up" to remove orphans first.

mhasself · 2026-04-01T18:05:29Z

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

Yeah, if it's not simple I'm not super concerned.

Should be fixed in latest push.

BrianJKoopman

Great, thanks for those changes!

One more question. I'm finding if I add a new docker managed agent (really I'm removing one and then re-adding it) it gets picked up as 'passive' only. It gets fully picked up if I restart the HM. Is that expected?

Either way, I'm happy with this as is. Let me know what you think and if you plan to make any more changes or if we should address later if at all.

mhasself · 2026-04-02T13:18:10Z

One more question. I'm finding if I add a new docker managed agent (really I'm removing one and then re-adding it) it gets picked up as 'passive' only. It gets fully picked up if I restart the HM. Is that expected?

Did you reload the SCF, with "Refresh config" button? Without that, it won't know the new docker service is for an agent. Perhaps annoying that it rescans compose files constantly, but SCF only on demand...

BrianJKoopman · 2026-04-02T13:26:59Z

Did you reload the SCF, with "Refresh config" button? Without that, it won't know the new docker service is for an agent. Perhaps annoying that it rescans compose files constantly, but SCF only on demand...

Ah, that does work, but only on the second run of 'update'.

BrianJKoopman · 2026-04-02T13:29:21Z

Some logs after adding the agent back in:

2026-04-02T09:27:43-0400 manager:0 Adding non-agent service "ocs-fake-data1"
2026-04-02T09:27:53-0400 start called for update
2026-04-02T09:27:53-0400 update:12 Status is now "starting".
2026-04-02T09:27:53-0400 update:12 Status is now "running".
2026-04-02T09:27:53-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-04-02T09:27:54-0400 update:12 Update requested.
2026-04-02T09:27:54-0400 update:12 Status is now "done".

At this point it's still marked as 'passive'. Then I run update again:

2026-04-02T09:28:08-0400 start called for update
2026-04-02T09:28:08-0400 update:13 Status is now "starting".
2026-04-02T09:28:08-0400 update:13 Status is now "running".
2026-04-02T09:28:08-0400 update:13 Managed agent "fake-data1" changed agent_class (('FakeDataAgent[d]',) -> FakeDataAgent[d?]) or management (docker -> docker) and is being reset!
2026-04-02T09:28:08-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-04-02T09:28:09-0400 update:13 Update requested.
2026-04-02T09:28:09-0400 update:13 Status is now "done".
2026-04-02T09:28:11-0400 manager:0 Requested launch for FakeDataAgent:fake-data1
2026-04-02T09:28:12-0400 manager:0 Launched FakeDataAgent:fake-data1

Now it's as expected, a fully managed agent.

mhasself · 2026-04-02T14:55:26Z

Wow, thorough testing. Should be fixed now!

BrianJKoopman

Looks good, thanks again for the updates!

mhasself added 9 commits March 30, 2026 10:43

Refactor HostManager to use dataclasses

073d1a4

- ManagedInstance is a dataclass instead of a dict - values from HostManager.database are always called "minst" (short for ManagedInstance).

HostManager: collect more data about docker images

ac10abd

HostManager: track orphan containers explicitly.

0a5d353

HostManager: remove_orphans task, to clean things up

60798de

HostManager: report agents that need a restart. docker_pull task

e722f2a

HostManager: make docker non-agent handling "passive"

197a09d

This means HostManager will not enforce state on unrecognized services, but will still permit user to bring them up and down (or attempt to do so).

HostManager: wait 11 seconds, instead of 5, for exit

97a58be

They always take 10 seconds to exit.

HostManager: update docs and minor bugfixes for recent changes

be91ddc

mhasself mentioned this pull request Mar 30, 2026

HostManager -- new docker stuff simonsobs/ocs-web#88

Merged

BrianJKoopman self-requested a review March 31, 2026 19:04

BrianJKoopman reviewed Mar 31, 2026

View reviewed changes

HostManager: is_running and exit_code props

7e7c7dc

This simplifies state machine logic. Extended start-up timeout to 15 seconds -- this should be enough time for docker "up" to remove orphans first.

BrianJKoopman self-requested a review April 1, 2026 20:43

BrianJKoopman reviewed Apr 1, 2026

View reviewed changes

HostManager: fix how unmanaged dockers transition to managed

b202c75

BrianJKoopman self-requested a review April 2, 2026 21:42

BrianJKoopman approved these changes Apr 2, 2026

View reviewed changes

BrianJKoopman merged commit 4b0c449 into main Apr 2, 2026
13 checks passed

BrianJKoopman deleted the mhasself/hm-docker-expert branch April 2, 2026 21:44

BrianJKoopman changed the title ~~HostManager: docker expert~~ HostManager: Improve handling of docker containers Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HostManager: Improve handling of docker containers#463

HostManager: Improve handling of docker containers#463
BrianJKoopman merged 11 commits intomainfrom
mhasself/hm-docker-expert

mhasself commented Mar 30, 2026 •

edited by BrianJKoopman

Loading

Uh oh!

mhasself commented Mar 30, 2026

Uh oh!

mhasself commented Mar 30, 2026

Uh oh!

BrianJKoopman left a comment

Uh oh!

mhasself commented Apr 1, 2026

Uh oh!

BrianJKoopman commented Apr 1, 2026

Uh oh!

mhasself commented Apr 1, 2026

Uh oh!

BrianJKoopman left a comment •

edited

Loading

Uh oh!

mhasself commented Apr 2, 2026

Uh oh!

BrianJKoopman commented Apr 2, 2026

Uh oh!

BrianJKoopman commented Apr 2, 2026

Uh oh!

mhasself commented Apr 2, 2026

Uh oh!

BrianJKoopman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mhasself commented Mar 30, 2026 • edited by BrianJKoopman Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

mhasself commented Mar 30, 2026

Uh oh!

mhasself commented Mar 30, 2026

Uh oh!

BrianJKoopman left a comment

Choose a reason for hiding this comment

Uh oh!

mhasself commented Apr 1, 2026

Uh oh!

BrianJKoopman commented Apr 1, 2026

Uh oh!

mhasself commented Apr 1, 2026

Uh oh!

BrianJKoopman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhasself commented Apr 2, 2026

Uh oh!

BrianJKoopman commented Apr 2, 2026

Uh oh!

BrianJKoopman commented Apr 2, 2026

Uh oh!

mhasself commented Apr 2, 2026

Uh oh!

BrianJKoopman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mhasself commented Mar 30, 2026 •

edited by BrianJKoopman

Loading

BrianJKoopman left a comment •

edited

Loading