Skip to content

HostManager: Improve handling of docker containers#463

Merged
BrianJKoopman merged 11 commits intomainfrom
mhasself/hm-docker-expert
Apr 2, 2026
Merged

HostManager: Improve handling of docker containers#463
BrianJKoopman merged 11 commits intomainfrom
mhasself/hm-docker-expert

Conversation

@mhasself
Copy link
Copy Markdown
Member

@mhasself mhasself commented Mar 30, 2026

Description

Improves overall HostManager handling of docker containers. Specifically:

  • HostManager keeps track of when services have changed tag, and whether or not that tag is available on the system. Also notices orphan services. All that is available in session.data.
  • New task to run "docker compose pull" on all managed compose files.
  • New task to remove orphans.
  • The handling of non-agent docker services is explicitly passive. You can put things up / down, but that state will not be enforced as a target state. This should allow non-agent services to co-exist in the same docker compose file, and be manipulated from the command line (or HostManager) without conflicts.
  • "die" task has "disown_dockers" parameter, which if passed will cause all the running dockerized agents tonot be brought down when HostManager exits. This can be used to restart HostManager without restarting the docker-based agents.

Improvements as part of above:

  • HostManager waits 11 seconds instead of 5 before complaining about a service not exiting... it always takes 10 seconds for services to exit, so this reduces log spam.
  • ManagedInstance changed from dict to dataclass, and nomenclature for an instance if such is now universally "minst" (was previously one of state, db, instance ...).

Motivation and Context

This work was spurred by the desire for the "pull_dockers" feature, but led to a bit of an overhaul for a bunch of little quality of life things.

The one "breaking" change here is that non-agent docker instances are now passively managed -- they won't be brought down when HostManager exits. So if that behavior is relied upon, this could be a problem. In SO we carefully separate those services into separate compose files. If need be an agent arg could be added to select between old and new behavior.

Resolves #363.

How Has This Been Tested?

Manually, cycling over the various things that can be weird in a docker-based ocs (changing service names, inconsistencies between compose and scf, etc).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

- ManagedInstance is a dataclass instead of a dict
- values from HostManager.database are always called "minst" (short
  for ManagedInstance).
This means HostManager will not enforce state on unrecognized
services, but will still permit user to bring them up and down (or
attempt to do so).
…ervices

Calling die(disown_dockers=True) will cause HostManager to put all
running docker services into passive_tracking mode so they are not
brought down on exit.  As long as manage=docker/*up*, the services
will be picked up and not messed with when HostManager next starts up.
They always take 10 seconds to exit.
@mhasself
Copy link
Copy Markdown
Member Author

Will require an ocs-web update to take advantage of new features -- will link here with a couple screen shots when available.

@mhasself
Copy link
Copy Markdown
Member Author

Here's the updated panel (coming together in simonsobs/ocs-web#88) -- indicator lights correspond to needing a docker pull and restart one agent.

image

@BrianJKoopman BrianJKoopman self-requested a review March 31, 2026 19:04
Copy link
Copy Markdown
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I did find one small thing during testing, which maybe you can comment on.

If I ensure there is an orphan first, then I go to command a non-orphan docker container, bringing it down and then back up. On the up it has to wait for the few seconds (10?) exit of the orphan from the new --remove-orphans flag in the docker compose up command. This generates what looks like a failed start of the agent I'm trying to command. After this initial apparent failed launch, the orphan has been removed, and the second launch succeeds.

Here are some logs, starting from an update call. I then remove fake-data1 from the compose file (making it an orphan), then command fake-data3 down and back up:

2026-03-31T16:47:20-0400 start called for update
2026-03-31T16:47:20-0400 update:19 Status is now "starting".
2026-03-31T16:47:20-0400 update:19 Status is now "running".
2026-03-31T16:47:20-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-03-31T16:47:20-0400 update:19 Update requested.
2026-03-31T16:47:20-0400 update:19 Status is now "done".
2026-03-31T16:47:37-0400 manager:0 Marking missing service for fake-data1
2026-03-31T16:48:47-0400 start called for update
2026-03-31T16:48:47-0400 update:20 Status is now "starting".
2026-03-31T16:48:47-0400 update:20 Status is now "running".
2026-03-31T16:48:47-0400 update:20 Update requested.
2026-03-31T16:48:47-0400 update:20 Status is now "done".
2026-03-31T16:48:47-0400 manager:0 Requesting termination of FakeDataAgent:fake-data3
2026-03-31T16:48:55-0400 manager:0 Agent instance FakeDataAgent:fake-data3 has exited
2026-03-31T16:48:59-0400 start called for update
2026-03-31T16:48:59-0400 update:21 Status is now "starting".
2026-03-31T16:48:59-0400 update:21 Status is now "running".
2026-03-31T16:48:59-0400 update:21 Update requested.
2026-03-31T16:48:59-0400 update:21 Status is now "done".
2026-03-31T16:49:00-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:01-0400 manager:0 Launched FakeDataAgent:fake-data3
2026-03-31T16:49:02-0400 manager:0 Detected exit of FakeDataAgent:fake-data3 with code 127.
2026-03-31T16:49:07-0400 manager:0 Requested launch for FakeDataAgent:fake-data3
2026-03-31T16:49:08-0400 manager:0 Launched FakeDataAgent:fake-data3

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

@mhasself
Copy link
Copy Markdown
Member Author

mhasself commented Apr 1, 2026

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

@BrianJKoopman
Copy link
Copy Markdown
Member

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

Yeah, if it's not simple I'm not super concerned.

I'd say it would be a fairly common occurrence though when an agent does get removed in the config file without bringing them down first, unless the user who removes the agent knows to go run remove_orphans() afterwards.

This simplifies state machine logic. Extended start-up timeout to 15
seconds -- this should be enough time for docker "up" to remove
orphans first.
@mhasself
Copy link
Copy Markdown
Member Author

mhasself commented Apr 1, 2026

Is there any way we can be more patient on the up command? Or some other way to mitigate the potentially confusing log message?

This is actually quite tricky to improve. And it's barely broken... but I'll look into it.

Yeah, if it's not simple I'm not super concerned.

Should be fixed in latest push.

@BrianJKoopman BrianJKoopman self-requested a review April 1, 2026 20:43
Copy link
Copy Markdown
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for those changes!

One more question. I'm finding if I add a new docker managed agent (really I'm removing one and then re-adding it) it gets picked up as 'passive' only. It gets fully picked up if I restart the HM. Is that expected?

Either way, I'm happy with this as is. Let me know what you think and if you plan to make any more changes or if we should address later if at all.

@mhasself
Copy link
Copy Markdown
Member Author

mhasself commented Apr 2, 2026

One more question. I'm finding if I add a new docker managed agent (really I'm removing one and then re-adding it) it gets picked up as 'passive' only. It gets fully picked up if I restart the HM. Is that expected?

Did you reload the SCF, with "Refresh config" button? Without that, it won't know the new docker service is for an agent. Perhaps annoying that it rescans compose files constantly, but SCF only on demand...

@BrianJKoopman
Copy link
Copy Markdown
Member

Did you reload the SCF, with "Refresh config" button? Without that, it won't know the new docker service is for an agent. Perhaps annoying that it rescans compose files constantly, but SCF only on demand...

Ah, that does work, but only on the second run of 'update'.

@BrianJKoopman
Copy link
Copy Markdown
Member

Some logs after adding the agent back in:

2026-04-02T09:27:43-0400 manager:0 Adding non-agent service "ocs-fake-data1"
2026-04-02T09:27:53-0400 start called for update
2026-04-02T09:27:53-0400 update:12 Status is now "starting".
2026-04-02T09:27:53-0400 update:12 Status is now "running".
2026-04-02T09:27:53-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-04-02T09:27:54-0400 update:12 Update requested.
2026-04-02T09:27:54-0400 update:12 Status is now "done".

At this point it's still marked as 'passive'. Then I run update again:

2026-04-02T09:28:08-0400 start called for update
2026-04-02T09:28:08-0400 update:13 Status is now "starting".
2026-04-02T09:28:08-0400 update:13 Status is now "running".
2026-04-02T09:28:08-0400 update:13 Managed agent "fake-data1" changed agent_class (('FakeDataAgent[d]',) -> FakeDataAgent[d?]) or management (docker -> docker) and is being reset!
2026-04-02T09:28:08-0400 Installed OCS Plugins: ['socs', 'ocs']
2026-04-02T09:28:09-0400 update:13 Update requested.
2026-04-02T09:28:09-0400 update:13 Status is now "done".
2026-04-02T09:28:11-0400 manager:0 Requested launch for FakeDataAgent:fake-data1
2026-04-02T09:28:12-0400 manager:0 Launched FakeDataAgent:fake-data1

Now it's as expected, a fully managed agent.

@mhasself
Copy link
Copy Markdown
Member Author

mhasself commented Apr 2, 2026

Wow, thorough testing. Should be fixed now!

@BrianJKoopman BrianJKoopman self-requested a review April 2, 2026 21:42
Copy link
Copy Markdown
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks again for the updates!

@BrianJKoopman BrianJKoopman merged commit 4b0c449 into main Apr 2, 2026
13 checks passed
@BrianJKoopman BrianJKoopman deleted the mhasself/hm-docker-expert branch April 2, 2026 21:44
@BrianJKoopman BrianJKoopman changed the title HostManager: docker expert HostManager: Improve handling of docker containers Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HostManager leaves containers from old services

2 participants