Worker managed cpu utilization by Dym03 · Pull Request #1073 · It4innovations/hyperqueue

Dym03 · 2026-03-07T21:19:36Z

This PR adds a option to show utilization with distinguished cpus that are used by running pinned tasks that are assigned to a worker. Without the pinning the feature doesn't work well now.
Currently the change of color scheme is used to distinguish the cpus. But maybe the graying out of the other cpus might be better.

Keybindings added is 'c' that enables the user to toggle between global and worker specific cpu usage.

Features that might be connected with this:

Add this cpu info into the hq worker info command
If multiple tasks are runned, there could be a specific mode to filter the running tasks further to show the cpus only for specific task/job
Update the cpu utilization history, to work with the global / worker specific information

This is part of the feature requested in #1016

Kobzol · 2026-03-10T10:30:18Z

Hi, thank you for the PR! I haven't checked the code in detail yet, but at a glance it looks fine.

I'll answer the questions from #1016 (comment) here:

I don't think we need to deal with pinning in any special way. One of the use-cases for this feature is actually to let users (visually) know that their programs might not be using the right cores. If they don't use pinning properly, they will see that the CPUs being utilized do not correspond to the ones assigned to the task. So it is more of a "detect that I do pinning wrong" feature rather than "only show this if pinning is enabled" feature. The visualization not corresponding to task utilization is a feature, not a bug :)
I don't think so.
Good question :) This is related to what resources do we actually show; global ones or only those assigned to the worker (see below for more discussion)? To avoid dealing with too many things at once, let's assume that we only deal with "all workers CPUs" vs "all CPUs assigned to tasks" for the moment. I think that we have a few options there:
- Show only the assigned task CPU utilization
- Show both assigned and total CPU utilization (but here we might have to rethink the UI layout, the chart title is becoming a bit cramped)
- Switch between total worker and assigned CPU utilization with a button, as you have now
Not sure if I like the button switch. While it is useful to see both utilization regimes, it is also not very discoverable; when I tried your PR, I was confused by why doesn't it react to assigned tasks, before I realized that I have to press C. At the very least, the assigned task utilization should be the default, IMO.
I would personally prefer graying out the unassigned tasks (or having three levels - color for unmanaged CPUs, color for unassigned CPUs, and then green/yellow/red for assigned CPUs). In the current scheme it's not immediately obvious to me that the blue parts are not that important. But I'd have to see how it looks :)

There is also kind of a second level of utilization that we could somehow show visually - which CPUs are assigned to the given worker. Because right now we show all CPUs of the given PC, and the CPUs assigned to tasks, but in theory the worker might be managing only a subset of the CPUs (though that is probably relative rare? hard to tell). In fact, maybe we should just get rid of "global" CPU utilization and always just deal with the assigned resources (at least for CPUs, for memory it's more complicated).

It could be useful follow-up work, to 1) only show the worker assigned CPUs here, and also to only gather utilization for those CPUs in the worker collection loop.

Another follow-up could be to separate memory utilization across tasks (based on the memory utilization of their processes), but that can be very tricky to pull off, as tasks can spawn subprocesses, etc.

CC @spirali about the color scheme (blue vs graying out unassigned CPUs) and if you think that we should show all CPUs on the worker, or only those managed by the worker. The latter would be more consistent with how we treat non-CPU resources.

spirali · 2026-03-10T10:47:53Z

Majority of our users probably have whole node, but I am aware of some users that do subnode SLURM allocation. AFAIK they do not usually have more workers on a single node, they just have a single worker on a node that manages just subset of the node.

Graying out non-managed resources seems good to me. But it should be always possible to see non-managed CPUs (or ideally all node resources). There are usually two use cases:

"I want to know what is utilization of my tasks", then I usually want to know only managed resources.
"I want to know what is happening with the HW/node", then I usually want to see all information.

Let us say that will eventually have some "top"-like utility. For the first question, we want to only processes spawned by our task. For the second question we want to see all processes.

So we should probably support both views. I have no strong opinion on what is a good default.

Kobzol · 2026-03-10T10:50:25Z

Well, the thing is that knowing all resources of the node is not really something that we can robustly know. Currently, we sorta try to guess it for CPUs, but even there it might not be accurate. And for GPUs and other resources, we often might only have access to a subset of the resources.

I think that from our point of view, we cannot reliably saw what is the "whole node", and should mostly talk only about resources managed by HQ workers. What we could do though is to say something like "we detected N additional CPUs on the node" or something.

spirali · 2026-03-10T11:00:35Z

I wrote in "ideal case"; I know that we cannot provide "generic HW monitoring tool", but other CPUs are quite easy, so if we can provide them than I think that we should show them.

Kobzol · 2026-03-10T11:02:50Z

but other CPUs are quite easy

I guess that depends on how SLURM is configured, it is definitely possible to hide some CPUs from HyperQueue.

But yeah, I guess that we can show all CPUs (that are visible to us), especially since that we already do this today 😆 Just that I would treat it more as an auxiliary information rather than the main thing that we want to present.

Dym03 · 2026-03-10T11:28:49Z

I do agree, that not showing the worker assigned cpus utilization as default is not ideal, but i wanted to keep the original intention, and this as an added feature. But as you mention above, I agree that showing it as a default makes more sense, and opting out to see "all cpus" could be a alternative.

Also thinking about it now, if the user wants to see how the whole node/(all cpus in the list) are doing, they surely can do it without the fancy colors. So maybe even keep it simple without the switch might be a good option.

So i will change the color scheme to classic green->red to assigned cpus, and graying out the other cpus. Set it as a default. If you come to a final decision about the possibility to switch between all cpus <> assigned cpus, i can remove it or keep it as is.

I can try to brainstorm a bit to come up with new layout to represent the usage statistics. Alternatively we can discuss it when your schedule frees up.

Kobzol · 2026-03-10T12:30:02Z

👍 on all you said. Regarding

to switch between all cpus <> assigned cpus, i can remove it or keep it as is.

you can keep it, but please make the assigned CPUs view be the default.

spirali · 2026-03-10T13:52:26Z

I vote for keeping "assigned only" view, or at least sorting assigned cpus at the beginning. The original motivation for all this was when I need to find utilization of "my" 4 cpus on 256+ cpus machine.

Dym03 · 2026-03-13T20:03:45Z

As we discussed above, I have used the gray out of the unused cpus in the Worker cpu utilization. Also it is set as a default.

I also added the sorting of the used cpus as was metioned by Ada. Because i though that it makes sense on the bigger nodes, than mine, to be able to quickly locate and read the data that i want. In addition to that now in the worker utilization view, the usage of the assigned cpus is shown, and also the count. Let me know if it is ok or no.

Idea: Because maybe it might be misleading to show (5 CPUS) if the worker can work with all of them. So maybe {num_used}/{num_available} in showcase below (5/16 CPUS) would be clearer.

I am not sure about the change in the title, if Node is the correct thing to call it, as you discussed above.

Demonstration of the current state below:

Worker Utilization (Default)

Node (Global) Utilization

Kobzol · 2026-03-14T12:00:12Z

I didn't like the sorting at first, but I guess it makes sense, there's not really much point in seeing the CPUs ordered by their ID (except for hyper-threading).

Node utilization sounds good. In the worker utilization, there indeed might be some ways of separating assigned vs managed CPUs, but the space in the title is quite limited. Now that I see the output, I wonder if we should actually have three views:

Node utilization view: show all CPUs on the node visible to HyperQueue. If the set of node CPUs is the same as the set of worker managed CPUs, ignore this view (do not switch to it), as it would be the same as the worker view.
Worker utilization view: show all CPUs managed by the worker. Sort CPUs that are assigned to tasks first in the CPU list, but compute and show the total utilization of the worker in the title. This is useful to figure out how much of the worker resources are being utilized. I'd set this as the default view.
Assigned tasks view: show all CPUs currently assigned to some task on the worker. Show the total CPU utilization only across the assigned tasks. This is useful to figure out how much do my tasks utilize the CPUs that they were assigned.

@spirali What do you think of the three views idea? :)

One problem with that is that we cannot easily determine the memory utilization of the worker based on tasks and CPUs. But showing the memory utilization in the title is a bit of a hack anyway, maybe we could show that elsewhere, and keep the CPU view only for CPUs, and not memory?

Dym03 · 2026-03-14T16:53:32Z

Adding the extra view shouldn't be a problem.

The question is how do i get the information about the managed cpus against the node cpus. This seems to be a hard thing to aquire, afaik. Correct me if i am wrong, but right now only the assigned cpus are 'easy' to distinguish.

Kobzol · 2026-03-14T18:52:24Z

I think you already have all the information. The node list of CPUs is returned from the HW overview. The CPU resources managed by the worker are known, they are stored in the worker configuration. And you also know the CPUs of tasks that are currently assigned to the worker.

spirali · 2026-03-16T15:47:49Z

Node utilization sounds good. In the worker utilization, there indeed might be some ways of separating assigned vs managed CPUs, but the space in the title is quite limited. Now that I see the output, I wonder if we should actually have three views:

* Node utilization view: show all CPUs on the node visible to HyperQueue. If the set of node CPUs is the same as the set of worker managed CPUs, ignore this view (do not switch to it), as it would be the same as the worker view.

* Worker utilization view: show all CPUs managed by the worker. Sort CPUs that are assigned to tasks first in the CPU list, but compute and show the total utilization of the worker in the title. This is useful to figure out how much of the worker resources are being utilized. I'd set this as the default view.

* Assigned tasks view: show all CPUs currently assigned to some task on the worker. Show the total CPU utilization only across the assigned tasks. This is useful to figure out how much do my tasks utilize the CPUs that they were assigned.

@spirali What do you think of the three views idea? :)

I think that may be a slightly redundant, as the worker view is a kind of subset of the node view. And also these two views matches for many people. I see worker/node view more like a on/off switch on worker view. But generally I do not care much, as long I can see both information.

btw: I like the sorted CPU view.

Kobzol · 2026-03-17T10:44:16Z

I think that in most cases the worker and node CPU list will be the same, so the separate node view wouldn't be displayed at all. And I'd rather have consistent UI controls where we switch between three views, rather than have two views where one of them has two modes :)

Dym03 · 2026-03-21T11:32:11Z

I have tried to implement the stated above, so 3 different views, (2 in the case that the managed cpus are identical to those captured by HQ). I am not sure about the current enums if they shouldn't be somehow maybe merged together like WorkerManaged {assigned, managed}, WorkerAssigned {assigned} and so on. But this might be complicated to switch the views then. So i kept it like this.

For the color scheme i have opted for

Global: Green,Yellow, Red for the Global view.
Worker Managed -> G,Y,R for assigned and Light Blue, Cyan, Magenta for the managed
Worker Assigned -> G,Y,R

I am using Vec::contains in lot of places, this might be slow if the cpu list is large, so maybe using a hashmap might be useful, but wasn't sure if it is worth it, so i would like to hear your opinion on that :)

Also i was not sure about the --cpus argument, like if there would be any case were there would be multiple requirements, like --cpus 8, and then someone would change it, or add another --cpus so i kept the vec in the estimate_scope. But if this is not the case, than the method could be simplified.

Also in this method the Sum is said to be unreachable, not sure if this should be in a code, or if this should be handeled more gratefully and just default to Node.

Current state:

Worker Managed View (Default), with half of the cpus used

Worker Assigned View

Node View

Dym03 · 2026-03-21T15:51:36Z

Also while implementing i was thinking about maybe wrapping the cpu utilization into a class, that would maybe lessen the clutter in the draw method of the WorkerDetail. Let me know if this would be wanted. As right now it is the only function based widget in the WorkerDetail

Dym03 · 2026-03-22T19:29:40Z

I have moved the util list into a separate class, that makes it less cluttered in the draw() method in WorkerDetail. Otherwise functionality stayed the same as mentioned above.

Only difference is that cpu are now ordered across columns, and not rows as before, because of current filling of the rows. But if this wouldn't be to your liking, this could be rewriten, but the code wouldn't be as clean.

Example of current state

Before merging the last commits would be squashed, as right now i kept it if you wouldn't like the added changes :)

Kobzol · 2026-03-24T12:20:33Z

I am using Vec::contains in lot of places, this might be slow if the cpu list is large, so maybe using a hashmap might be useful, but wasn't sure if it is worth it, so i would like to hear your opinion on that :)

I'm pretty sure that Vec::contains will be as fast or faster than a hashmap lookup, unless the machine will have thousands of cores.

Kobzol

Thank you, the UI looks great. Left some comments about the implementation.

Regarding the column vs row sorting, it's quite hard for me to read when the numbers grow to the right. When I look at a single CPU, my eyes see what is below it, but not what is to the (far) right/left of it. So I would like to go back to the previous ordering.

Kobzol · 2026-03-24T12:27:14Z

crates/hyperqueue/src/dashboard/ui/screens/cluster/worker/cpu_util_table.rs

+#[derive(Debug, Default)]
+pub enum CpuScope {
+    #[default]
+    Node,


Some comments on this would be nice, to explain what do those modes mean.

Actually, after reading the code, I think that this is unnecessarily complex. We know the node CPUs from self.utilization.cpus (well, actually we just guesstimate that the indices of CPUs returned here correspond to the actual CPU core indices from the node, which will not always hold, but let's not deal with that in this PR), we know the managed CPUs from the resource definition, and we know the CPUs assigned to tasks from the worker overview.

We can remember the last two things as normal fields, and then compute if the node view should be available dynamically, by comparing if the managed CPU set is the same as the CPU set from utilization.

So I think we should change the data model to deal with less stuff.

If there is no worker configuration available at a given time, there is nothing to show (it should always be available if there is other data at the given time).

If there is worker configuration available, we must know managed_cpus.

If there is no overview available at a given time, there is nothing to show.

If there is overview available, we must have assigned_cpus available.

If there is overview available, we may have HW utilization available. But that should not contain assigned_cpus, because that is unrelated to HW utilization.

So something like this:

struct Utilization { cpu: Vec<f64>, memory: MemoryStats, } struct WorkerCpuState { /// CPU core resources currently managed by this worker. managed_cpus: Set<ResourceIndex>, /// CPU cores assigned to at least a single task that is currently running on this worker. assigned_cpus: Set<ResourceIndex> } #[derive(Default)] pub struct CpuUtilTable { utilization: Option<Utilization>, cpu_state: Option<WorkerCpuState>, // If this is None, we skip drawing cpu_view_mode: CpuViewMode, }

What do you think?

That seems reasonable. Thanks for the feedback 🙂 I was trying to use the enums so it is exact in which state it is. I will have a look at it asap.

Kobzol · 2026-03-24T12:32:36Z

crates/hyperqueue/src/dashboard/ui/screens/cluster/worker/cpu_util_table.rs

+            .workers()
+            .query_worker_overview_at(worker_id, data.current_time())
+        {
+            let worker_used_cpus: Vec<ResourceIndex> = match self.cpu_view_mode {


This should be a set. Not because of performance, but because in theory multiple tasks can be assigned the same CPU, because HQ supports fractional tasks.

True, thanks for noting that. Right now i have tested it on whole cpus and not fractions. That would have caused a weird behavior

Kobzol · 2026-03-24T12:33:57Z

crates/hyperqueue/src/dashboard/ui/screens/cluster/worker/cpu_util_table.rs

+            }
+
+            if let Some(configuration) = worker_config {
+                let managed_cpus: Vec<&ResourceDescriptorItem> = configuration


This should only ever be a single resource, there shouldn't ever be less than one cpus resources, nor more than one. We can assert that.

Ok then 🙂

spirali · 2026-03-24T13:35:46Z

I like the final UI.

FYI: The biggest machine I am aware of where HQ is used has 512 cores per node.

Dym03 · 2026-03-27T12:32:30Z

I have tried to implement the points mentioned in the review above.

The sorting of the cpus is back as it was
Example:

I still kinda liked the way the code looked with the cpu scope instead of simple boolean, so I kept it there. But if you really don't like it, it can be substituted for the bool.

Dym03 force-pushed the Dashboard_worker_assigned_cpu_util_on_off branch from d70f387 to f61e2f7 Compare March 13, 2026 19:53

Dym03 force-pushed the Dashboard_worker_assigned_cpu_util_on_off branch from aa43839 to 9361df1 Compare March 21, 2026 11:20

Dym03 force-pushed the Dashboard_worker_assigned_cpu_util_on_off branch from de1868c to b3c250c Compare March 22, 2026 19:28

Kobzol reviewed Mar 24, 2026

View reviewed changes

Dym03 force-pushed the Dashboard_worker_assigned_cpu_util_on_off branch from b3c250c to 48eb3f3 Compare March 27, 2026 12:18

Dym03 added 3 commits March 27, 2026 13:19

Add worker managed cpu utilization

f7d19ed

Add third view for only asssigned cpus

fb63d66

Rewrite Cpu Util Table to class

52b2cac

Dym03 force-pushed the Dashboard_worker_assigned_cpu_util_on_off branch from 48eb3f3 to 52b2cac Compare March 27, 2026 12:19

Conversation

Dym03 commented Mar 7, 2026

Uh oh!

Kobzol commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spirali commented Mar 10, 2026

Uh oh!

Kobzol commented Mar 10, 2026

Uh oh!

spirali commented Mar 10, 2026

Uh oh!

Kobzol commented Mar 10, 2026

Uh oh!

Dym03 commented Mar 10, 2026

Uh oh!

Kobzol commented Mar 10, 2026

Uh oh!

spirali commented Mar 10, 2026

Uh oh!

Dym03 commented Mar 13, 2026

Uh oh!

Kobzol commented Mar 14, 2026

Uh oh!

Dym03 commented Mar 14, 2026

Uh oh!

Kobzol commented Mar 14, 2026

Uh oh!

spirali commented Mar 16, 2026

Uh oh!

Kobzol commented Mar 17, 2026

Uh oh!

Dym03 commented Mar 21, 2026

Uh oh!

Dym03 commented Mar 21, 2026

Uh oh!

Dym03 commented Mar 22, 2026

Uh oh!

Kobzol commented Mar 24, 2026

Uh oh!

Kobzol left a comment

Choose a reason for hiding this comment

Uh oh!

Kobzol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Kobzol Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dym03 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Kobzol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Dym03 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Kobzol Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Dym03 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

spirali commented Mar 24, 2026

Uh oh!

Dym03 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kobzol commented Mar 10, 2026 •

edited

Loading

Kobzol Mar 24, 2026 •

edited

Loading