Conversation
|
Hi, thank you for the PR! I haven't checked the code in detail yet, but at a glance it looks fine. I'll answer the questions from #1016 (comment) here:
There is also kind of a second level of utilization that we could somehow show visually - which CPUs are assigned to the given worker. Because right now we show all CPUs of the given PC, and the CPUs assigned to tasks, but in theory the worker might be managing only a subset of the CPUs (though that is probably relative rare? hard to tell). In fact, maybe we should just get rid of "global" CPU utilization and always just deal with the assigned resources (at least for CPUs, for memory it's more complicated). It could be useful follow-up work, to 1) only show the worker assigned CPUs here, and also to only gather utilization for those CPUs in the worker collection loop. Another follow-up could be to separate memory utilization across tasks (based on the memory utilization of their processes), but that can be very tricky to pull off, as tasks can spawn subprocesses, etc. CC @spirali about the color scheme (blue vs graying out unassigned CPUs) and if you think that we should show all CPUs on the worker, or only those managed by the worker. The latter would be more consistent with how we treat non-CPU resources. |
|
Majority of our users probably have whole node, but I am aware of some users that do subnode SLURM allocation. AFAIK they do not usually have more workers on a single node, they just have a single worker on a node that manages just subset of the node. Graying out non-managed resources seems good to me. But it should be always possible to see non-managed CPUs (or ideally all node resources). There are usually two use cases:
Let us say that will eventually have some "top"-like utility. For the first question, we want to only processes spawned by our task. For the second question we want to see all processes. So we should probably support both views. I have no strong opinion on what is a good default. |
|
Well, the thing is that knowing all resources of the node is not really something that we can robustly know. Currently, we sorta try to guess it for CPUs, but even there it might not be accurate. And for GPUs and other resources, we often might only have access to a subset of the resources. I think that from our point of view, we cannot reliably saw what is the "whole node", and should mostly talk only about resources managed by HQ workers. What we could do though is to say something like "we detected N additional CPUs on the node" or something. |
|
I wrote in "ideal case"; I know that we cannot provide "generic HW monitoring tool", but other CPUs are quite easy, so if we can provide them than I think that we should show them. |
I guess that depends on how SLURM is configured, it is definitely possible to hide some CPUs from HyperQueue. But yeah, I guess that we can show all CPUs (that are visible to us), especially since that we already do this today 😆 Just that I would treat it more as an auxiliary information rather than the main thing that we want to present. |
|
I do agree, that not showing the worker assigned cpus utilization as default is not ideal, but i wanted to keep the original intention, and this as an added feature. But as you mention above, I agree that showing it as a default makes more sense, and opting out to see "all cpus" could be a alternative. Also thinking about it now, if the user wants to see how the whole node/(all cpus in the list) are doing, they surely can do it without the fancy colors. So maybe even keep it simple without the switch might be a good option. So i will change the color scheme to classic green->red to assigned cpus, and graying out the other cpus. Set it as a default. If you come to a final decision about the possibility to switch between all cpus <> assigned cpus, i can remove it or keep it as is. I can try to brainstorm a bit to come up with new layout to represent the usage statistics. Alternatively we can discuss it when your schedule frees up. |
|
👍 on all you said. Regarding
you can keep it, but please make the assigned CPUs view be the default. |
|
I vote for keeping "assigned only" view, or at least sorting assigned cpus at the beginning. The original motivation for all this was when I need to find utilization of "my" 4 cpus on 256+ cpus machine. |
d70f387 to
f61e2f7
Compare
|
As we discussed above, I have used the gray out of the unused cpus in the Worker cpu utilization. Also it is set as a default. I also added the sorting of the used cpus as was metioned by Ada. Because i though that it makes sense on the bigger nodes, than mine, to be able to quickly locate and read the data that i want. In addition to that now in the worker utilization view, the usage of the assigned cpus is shown, and also the count. Let me know if it is ok or no. Idea: Because maybe it might be misleading to show (5 CPUS) if the worker can work with all of them. So maybe {num_used}/{num_available} in showcase below (5/16 CPUS) would be clearer. I am not sure about the change in the title, if Node is the correct thing to call it, as you discussed above. Demonstration of the current state below: |
|
I didn't like the sorting at first, but I guess it makes sense, there's not really much point in seeing the CPUs ordered by their ID (except for hyper-threading). Node utilization sounds good. In the worker utilization, there indeed might be some ways of separating assigned vs managed CPUs, but the space in the title is quite limited. Now that I see the output, I wonder if we should actually have three views:
@spirali What do you think of the three views idea? :) One problem with that is that we cannot easily determine the memory utilization of the worker based on tasks and CPUs. But showing the memory utilization in the title is a bit of a hack anyway, maybe we could show that elsewhere, and keep the CPU view only for CPUs, and not memory? |
|
Adding the extra view shouldn't be a problem. The question is how do i get the information about the managed cpus against the node cpus. This seems to be a hard thing to aquire, afaik. Correct me if i am wrong, but right now only the assigned cpus are 'easy' to distinguish. |
|
I think you already have all the information. The node list of CPUs is returned from the HW overview. The CPU resources managed by the worker are known, they are stored in the worker configuration. And you also know the CPUs of tasks that are currently assigned to the worker. |
I think that may be a slightly redundant, as the worker view is a kind of subset of the node view. And also these two views matches for many people. I see worker/node view more like a on/off switch on worker view. But generally I do not care much, as long I can see both information. btw: I like the sorted CPU view. |
|
I think that in most cases the worker and node CPU list will be the same, so the separate node view wouldn't be displayed at all. And I'd rather have consistent UI controls where we switch between three views, rather than have two views where one of them has two modes :) |
aa43839 to
9361df1
Compare
|
Also while implementing i was thinking about maybe wrapping the cpu utilization into a class, that would maybe lessen the clutter in the draw method of the WorkerDetail. Let me know if this would be wanted. As right now it is the only function based widget in the WorkerDetail |
de1868c to
b3c250c
Compare
|
I have moved the util list into a separate class, that makes it less cluttered in the draw() method in WorkerDetail. Otherwise functionality stayed the same as mentioned above. Only difference is that cpu are now ordered across columns, and not rows as before, because of current filling of the rows. But if this wouldn't be to your liking, this could be rewriten, but the code wouldn't be as clean. Before merging the last commits would be squashed, as right now i kept it if you wouldn't like the added changes :) |
I'm pretty sure that |
Kobzol
left a comment
There was a problem hiding this comment.
Thank you, the UI looks great. Left some comments about the implementation.
Regarding the column vs row sorting, it's quite hard for me to read when the numbers grow to the right. When I look at a single CPU, my eyes see what is below it, but not what is to the (far) right/left of it. So I would like to go back to the previous ordering.
| #[derive(Debug, Default)] | ||
| pub enum CpuScope { | ||
| #[default] | ||
| Node, |
There was a problem hiding this comment.
Some comments on this would be nice, to explain what do those modes mean.
There was a problem hiding this comment.
Actually, after reading the code, I think that this is unnecessarily complex. We know the node CPUs from self.utilization.cpus (well, actually we just guesstimate that the indices of CPUs returned here correspond to the actual CPU core indices from the node, which will not always hold, but let's not deal with that in this PR), we know the managed CPUs from the resource definition, and we know the CPUs assigned to tasks from the worker overview.
We can remember the last two things as normal fields, and then compute if the node view should be available dynamically, by comparing if the managed CPU set is the same as the CPU set from utilization.
So I think we should change the data model to deal with less stuff.
- If there is no worker configuration available at a given time, there is nothing to show (it should always be available if there is other data at the given time).
- If there is worker configuration available, we must know
managed_cpus. - If there is no overview available at a given time, there is nothing to show.
- If there is overview available, we must have
assigned_cpusavailable. - If there is overview available, we may have HW utilization available. But that should not contain
assigned_cpus, because that is unrelated to HW utilization.
So something like this:
struct Utilization {
cpu: Vec<f64>,
memory: MemoryStats,
}
struct WorkerCpuState {
/// CPU core resources currently managed by this worker.
managed_cpus: Set<ResourceIndex>,
/// CPU cores assigned to at least a single task that is currently running on this worker.
assigned_cpus: Set<ResourceIndex>
}
#[derive(Default)]
pub struct CpuUtilTable {
utilization: Option<Utilization>,
cpu_state: Option<WorkerCpuState>, // If this is None, we skip drawing
cpu_view_mode: CpuViewMode,
}What do you think?
There was a problem hiding this comment.
That seems reasonable. Thanks for the feedback 🙂 I was trying to use the enums so it is exact in which state it is. I will have a look at it asap.
| .workers() | ||
| .query_worker_overview_at(worker_id, data.current_time()) | ||
| { | ||
| let worker_used_cpus: Vec<ResourceIndex> = match self.cpu_view_mode { |
There was a problem hiding this comment.
This should be a set. Not because of performance, but because in theory multiple tasks can be assigned the same CPU, because HQ supports fractional tasks.
There was a problem hiding this comment.
True, thanks for noting that. Right now i have tested it on whole cpus and not fractions. That would have caused a weird behavior
| } | ||
|
|
||
| if let Some(configuration) = worker_config { | ||
| let managed_cpus: Vec<&ResourceDescriptorItem> = configuration |
There was a problem hiding this comment.
This should only ever be a single resource, there shouldn't ever be less than one cpus resources, nor more than one. We can assert that.
|
I like the final UI. FYI: The biggest machine I am aware of where HQ is used has 512 cores per node. |
b3c250c to
48eb3f3
Compare
48eb3f3 to
52b2cac
Compare







This PR adds a option to show utilization with distinguished cpus that are used by running pinned tasks that are assigned to a worker. Without the pinning the feature doesn't work well now.
Currently the change of color scheme is used to distinguish the cpus. But maybe the graying out of the other cpus might be better.
Keybindings added is 'c' that enables the user to toggle between global and worker specific cpu usage.
Features that might be connected with this:
This is part of the feature requested in #1016