Conversation
14e632a to
c90dd80
Compare
📝 WalkthroughWalkthroughAdds a complete failover reservation subsystem: new controller, periodic and watch reconciliations, DB-backed VM source (Postgres/Nova), external scheduler client and pipelines, CRD/status fields and label key updates, Helm pipeline/templates, extensive unit/integration tests, docs, and a visualization CLI. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller as Failover Controller
participant VMSource as DB VM Source
participant Scheduler as External Scheduler
participant KubeAPI as Kubernetes API
Controller->>VMSource: ListVMs(ctx)
VMSource-->>Controller: []VM
Controller->>KubeAPI: List Hypervisors
KubeAPI-->>Controller: []Hypervisor
Controller->>KubeAPI: List Reservations (failover)
KubeAPI-->>Controller: []Reservation
loop For each VM needing failover coverage
Controller->>Scheduler: ScheduleReservationRequest(pipeline)
Scheduler-->>Controller: []candidate hosts
alt Reuse path
Controller->>Controller: tryReuseExistingReservation(vm, candidates)
Controller->>KubeAPI: Patch Reservation (add VM)
KubeAPI-->>Controller: Updated Reservation
else New reservation path
Controller->>Controller: scheduleAndBuildNewFailoverReservation(vm, candidates)
Controller->>KubeAPI: Create Reservation
KubeAPI-->>Controller: Created Reservation
end
end
sequenceDiagram
participant Watch as Watch Event
participant Controller as Failover Controller
participant Scheduler as External Scheduler
participant KubeAPI as Kubernetes API
Watch->>Controller: Reservation Event (failover)
Controller->>KubeAPI: Get Reservation
KubeAPI-->>Controller: Reservation
loop For each allocated VM
Controller->>Scheduler: validateVMViaSchedulerEvacuation(vm, host)
Scheduler-->>Controller: isValid
alt invalid
Controller->>KubeAPI: Patch Reservation.Status (invalid)
KubeAPI-->>Controller: Patched
end
end
alt all valid
Controller->>KubeAPI: Patch Reservation.Status.AcknowledgedAt
KubeAPI-->>Controller: Patched
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Pull request overview
Adds a proof-of-concept failover reservation controller that maintains pre-reserved evacuation capacity for VMs by creating/reusing/validating Reservation CRs, backed by Nova DB reads and scheduler pipeline calls.
Changes:
- Introduces
FailoverReservationControllerwith periodic bulk reconciliation and watch-based validation/acknowledgment. - Adds a small external scheduler HTTP client plus Postgres/Nova DB reader utilities and VM datasource implementation.
- Extends Reservation API/CRD with failover acknowledgment fields and adds dedicated failover scheduler pipelines.
Reviewed changes
Copilot reviewed 23 out of 25 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/scheduling/reservations/scheduler_client.go | New HTTP client for calling the external nova scheduler API. |
| internal/scheduling/reservations/failover/controller.go | Main failover controller (periodic + watch-based reconciliation). |
| internal/scheduling/reservations/failover/reservation_scheduling.go | Scheduling logic to reuse/create reservations via pipelines. |
| internal/scheduling/reservations/failover/reservation_eligibility.go | HA eligibility constraints + resource fit checks. |
| internal/scheduling/reservations/failover/vm_source.go | VM datasource implementation using Nova DB plus hypervisor CRD enrichment/filtering. |
| internal/scheduling/reservations/failover/helpers.go | Helpers to build/update failover reservations and allocations. |
| internal/scheduling/reservations/failover/*.go (tests) | Unit tests for scheduling/eligibility/controller helpers. |
| internal/scheduling/reservations/failover/config.go | Failover controller configuration and defaults. |
| internal/scheduling/reservations/controller/controller.go | Filters failover reservations out of the existing reservations controller watch. |
| internal/scheduling/nova/plugins/filters/filter_has_enough_capacity.go | Capacity filter hardened for missing capacity/unknown hosts. |
| internal/scheduling/external/postgres.go | New generic Postgres reader backed by a Datasource CRD secret ref. |
| internal/scheduling/external/nova.go | NovaReader built on PostgresReader (servers/flavors/hypervisors/etc.). |
| api/v1alpha1/reservation_types.go | Adds LastChanged and AcknowledgedAt to failover reservation status. |
| api/v1alpha1/zz_generated.deepcopy.go | DeepCopy updates for new failover status fields. |
| helm/library/cortex/files/crds/cortex.cloud_reservations.yaml | CRD schema updates for new failover status fields. |
| helm/bundles/cortex-nova/templates/pipelines_kvm.yaml | Adds three new KVM pipelines for failover reservation workflows. |
| cmd/main.go | Wires up the new failover controller (config/defaulting/runner). |
| docs/failover-reservations.md | Documentation for the failover reservation system and workflows. |
| Makefile / .gitignore / helm/* | Build and template hygiene updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
internal/scheduling/reservations/failover/reservation_scheduling.go
Outdated
Show resolved
Hide resolved
|
I didn't have time to review yet, but we should not use |
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (1)
internal/scheduling/reservations/failover/reservation_eligibility_test.go (1)
613-698: Cover the resource shape the controller actually produces.These cases only exercise reservations with
Spec.Resources["vcpus"], butbuildFailoverReservation()ininternal/scheduling/reservations/failover/helpers.gowrites CPU underSpec.Resources["cpu"]. The suite can stay green while the controller fails to reuse the reservations it just created.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go` around lines 613 - 698, Tests use Spec.Resources["vcpus"] but the controller (buildFailoverReservation in helpers.go) writes CPU as Spec.Resources["cpu"], so add test cases (or update existing reservation fixtures via makeReservationWithResources) to include the "cpu" resource key (and/or include both "cpu" and "vcpus" to be explicit) with the same quantities used now for "vcpus"; ensure the cases that expect fit/fail check the "cpu" entry on the reservation fixtures so the test suite covers the real resource shape produced by buildFailoverReservation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/failover-reservations.md`:
- Around line 159-160: Finish the truncated sentence describing the periodic
evacuation check: complete the acknowledgment pipeline description by stating
that if the scheduler rejects the evacuation-style scheduling request (which
targets only the reservation's host), the reservation should be marked invalid,
the acknowledgment pipeline should notify the reservation owner/clients of the
failure, and trigger any cleanup or reallocation steps (e.g., release the
reserved host and update reservation state) so downstream systems can respond.
- Around line 7-15: The fenced code block showing the directory listing for
"internal/scheduling/reservations/failover/" is missing a language identifier,
causing markdown lint failures; update the opening triple-backtick to include a
language (use "text") for the block that contains the lines starting with
"internal/scheduling/reservations/failover/" so the snippet is fenced as ```text
... ``` and rendering/linting will pass.
In `@internal/scheduling/external/postgres.go`:
- Around line 52-64: DB() performs lazy initialization of r.db without
synchronization, causing a data race; protect initialization by adding a
synchronisation primitive on PostgresReader (e.g., a sync.Mutex field like
dbInitMu or a sync.Once plus an error field) and use it around reads/writes to
r.db. Specifically, modify the DB method to acquire the mutex (or use once.Do)
before checking r.db, call db.Connector{Client: r.Client}.FromSecretRef(...)
only when r.db is nil, set r.db (and any stored error) while still holding the
lock/within once, and return the stored db/error to concurrent callers; ensure
the mutex/once is declared on the PostgresReader type so the initialization is
thread-safe.
In `@internal/scheduling/reservations/controller/controller.go`:
- Around line 473-505: The predicate notFailoverReservationPredicate currently
uses only labels to detect failover reservations; change each predicate function
(CreateFunc, UpdateFunc, DeleteFunc, GenericFunc) to first check res.Spec.Type
for the failover type (e.g. res.Spec.Type == v1alpha1.ReservationTypeFailover)
and treat that as failover, and only if Spec.Type is empty/unknown fall back to
the existing label check (res.Labels[v1alpha1.LabelReservationType] ==
v1alpha1.ReservationTypeLabelFailover); invert the result so the predicate
returns false for failover based on Spec.Type or label. Apply the same
Spec-first logic to the other similar predicate referenced near the same block.
In `@internal/scheduling/reservations/failover/controller.go`:
- Around line 795-810: patchReservationStatus currently overwrites
current.Status with res.Status after refetching, which can clobber concurrent
updates; instead make the update conflict-safe by only patching the specific
fields you changed (e.g., allocations, LastChanged, AcknowledgedAt) or by
performing a read-modify-patch retry: fetch latest reservation, merge only the
delta from res into current.Status, create client.MergeFrom(old) from that
latest old copy, and retry the GET/modify/Patch loop on conflict (exponential
backoff / limited attempts) so you never blindly replace the whole status; refer
to function names current.Status, res.Status, patchReservationStatus,
client.MergeFrom and c.Status().Patch to locate where to apply the change.
- Around line 147-162: The controller is writing a new AcknowledgedAt on every
successful reconcile which, combined with the predicate that watches all
updates, causes a self-triggering loop; change the logic in the reconcile path
around updatedRes/updatedRes.Status.FailoverReservation.AcknowledgedAt and
patchReservationStatus so you only update AcknowledgedAt when it is nil or older
than the current validation timestamp (i.e., skip the patch when the existing
FailoverReservation.AcknowledgedAt already equals or is newer than
metav1.Now()), and/or only patch when some other status fields actually changed;
additionally update the predicate (the failover-reservation update predicate) to
ignore pure AcknowledgedAt-only status updates so status-only ACK writes do not
requeue the controller.
- Around line 286-293: The loop over reservationsToUpdate applies in-memory
mutations to the failoverReservations working set and logs
patchReservationStatus errors but continues, causing later steps (e.g., the
later loop at lines 302-307) to act on uncommitted state; change the logic in
the controller.go code that calls patchReservationStatus so that on any patch
error you either (A) abort the operation and return the error up (do not proceed
with further processing of failoverReservations) or (B) immediately re-fetch the
authoritative reservation from the API (e.g., via the controller's get/fetch
method) and update the in-memory failoverReservations entry before continuing;
apply the same behavior to the second loop at 302-307 to ensure no subsequent
steps operate on mutations that failed to persist.
- Around line 340-347: ReconcilePeriodic computes a dynamic RequeueAfter
(shorter when hitMaxVMsLimit) but Start currently ignores that and always sleeps
on Config.ReconcileInterval; update Start to respect the ctrl.Result from
ReconcilePeriodic by using the returned RequeueAfter as the next wait duration
(e.g., use time.After(result.RequeueAfter) or adjust the ticker interval)
instead of a fixed ticker, ensuring ReconcilePeriodic, MaxVMsToProcess,
Config.ShortReconcileInterval and Config.ReconcileInterval are used; apply the
same change to the other Start-like loop referenced (lines ~847-861) so
ShortReconcileInterval can take effect in production.
In `@internal/scheduling/reservations/failover/helpers.go`:
- Around line 123-130: buildFailoverReservation() currently omits the VM's
AvailabilityZone when constructing the reservation Spec so
reservation.Spec.AvailabilityZone becomes empty; update buildFailoverReservation
to set Spec.AvailabilityZone = vm.AvailabilityZone (or propagate
vm.AvailabilityZone into the created v1alpha1.ReservationSpec) so the created
reservation preserves the AZ constraint for later reuse/revalidation.
In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go`:
- Around line 1137-1198: TestSymmetryOfEligibility never asserts vm1Eligible so
symmetry isn't validated; call IsVMEligibleForReservation for tc.vm1 against
tc.reservation and compare to tc.vm1Eligible (same as vm2 check) to assert the
mirrored direction, and for fixtures where vm1 already occupies the reservation
(the asymmetric case) either create a mirrored fixture (swap vm roles or use a
separate reservation/setup) so the reverse check is meaningful; update the test
to run both vm2Result := IsVMEligibleForReservation(tc.vm2, ...) and vm1Result
:= IsVMEligibleForReservation(tc.vm1, ...) and assert both against
tc.vm2Eligible and tc.vm1Eligible respectively.
In `@internal/scheduling/reservations/failover/reservation_eligibility.go`:
- Around line 59-63: When recording the reservation host, use the desired host
as a fallback: if res.Status.Host is empty, set resHost = res.Spec.TargetHost
before calling ensureResInMaps and before assigning reservationToHost[resName] =
resHost; update the same pattern at the other occurrences noted (around where
resHost is read and reservationToHost is written) so unscheduled/pending
reservations record Spec.TargetHost instead of the empty string.
- Around line 57-70: The loop (over allFailoverReservations) and related map
logic collapse reservations by res.Name causing namespace collisions and
stale-candidate merging; change to use a unique key like resKey := res.Namespace
+ "/" + res.Name everywhere (including in ensureResInMaps, reservationToHost,
reservationToVMs, vmToReservations, vmToReservationHosts, vmToCurrentHypervisor)
and when handling the candidate entry ensure you replace the reservation entry
rather than only appending if missing (adjust IsVMEligibleForReservation usage
to expect/operate on the namespaced resKey). Apply the same namespaced-key fix
to the other affected blocks noted (lines ~75-87 and ~226-245) so all map
accesses consistently use the namespace/name identifier.
- Around line 271-280: DoesVMFitInReservation currently checks for
VM/Reservation CPU under the "vcpus" resource key but buildFailoverReservation
stores CPUs under "cpu" (and ValidateFailoverReservationResources expects
"vcpus"), causing created failover reservations to be rejected; update
DoesVMFitInReservation to tolerate both keys (check reservation.Spec.Resources
for "vcpus" first, then fallback to "cpu", or normalize the reservation
resources to the canonical "vcpus" key before comparison) so that
FindEligibleReservations can match reservations created by
buildFailoverReservation without changing validation behavior in
ValidateFailoverReservationResources.
In `@internal/scheduling/reservations/failover/reservation_scheduling.go`:
- Around line 166-203: The validateVMViaSchedulerEvacuation function is using a
separate vmCurrentHost parameter which can be stale; remove that parameter,
update the function signature of validateVMViaSchedulerEvacuation to only accept
(ctx context.Context, vm VM, reservationHost string), and change the
construction of scheduleReq.IgnoreHosts to use vm.CurrentHypervisor directly
(i.e., IgnoreHosts: []string{vm.CurrentHypervisor}); then update the single
caller site to stop passing the allocation host and call the new signature.
Ensure callers and references to vmCurrentHost are removed and rebuild usages of
reservations.ScheduleReservationRequest and any tests that referenced the old
parameter.
In `@internal/scheduling/reservations/failover/vm_source.go`:
- Around line 273-300: The loop over hypervisorList may append duplicate VM
entries when the same inst.ID is reported active on multiple hypervisors; modify
the code in the nested loop that builds VM (referencing hypervisorList,
hv.Status.Instances, vmDataByUUID and the VM struct) to maintain a local seen
map[string]bool (or set) of inst.ID values, skip and log any duplicate inst.ID
instead of appending, and ensure only the first trusted hypervisor location for
that UUID is added so calculateVMsMissingFailover() doesn't get duplicate
inputs.
In `@tools/visualize-reservations/main.go`:
- Around line 234-271: The current filter builds `hypervisors` and `allVMs` from
only matching hosts, causing reservations referencing VMs on excluded hosts to
be treated as missing; instead, build an unfiltered VM lookup (e.g.,
`allVMsFull` or reuse `allVMs` built from `allHypervisors.Items`) that contains
every active VM from `allHypervisors`, then build a separate filtered
`hypervisors` (using `matchesFilter` and `filteredHosts`) only for rendering;
update reservation checks (the NotOnHypervisors path) to consult the unfiltered
VM map (`allVMsFull`) for existence and flavor info while keeping the filtered
list for view generation.
- Around line 1311-1314: The current connStr built with fmt.Sprintf can break if
user/password contain spaces, quotes, or backslashes; replace the manual
key=value string with a percent-encoded URL using net/url and net. Construct a
url.URL with Scheme "postgres", User set via url.UserPassword(username,
password), Host from net.JoinHostPort(host, port), Path set to the database
name, and RawQuery "sslmode=disable", then pass u.String() to
sql.Open("postgres", connStr). Add imports "net" and "net/url" and remove the
unsafe fmt.Sprintf-based connStr construction.
---
Nitpick comments:
In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go`:
- Around line 613-698: Tests use Spec.Resources["vcpus"] but the controller
(buildFailoverReservation in helpers.go) writes CPU as Spec.Resources["cpu"], so
add test cases (or update existing reservation fixtures via
makeReservationWithResources) to include the "cpu" resource key (and/or include
both "cpu" and "vcpus" to be explicit) with the same quantities used now for
"vcpus"; ensure the cases that expect fit/fail check the "cpu" entry on the
reservation fixtures so the test suite covers the real resource shape produced
by buildFailoverReservation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24d622ea-7bc5-444a-a001-f03d3269c18e
📒 Files selected for processing (25)
.gitignoreMakefileapi/v1alpha1/reservation_types.goapi/v1alpha1/zz_generated.deepcopy.gocmd/main.godocs/failover-reservations.mdhelm/bundles/cortex-nova/templates/datasources_kvm.yamlhelm/bundles/cortex-nova/templates/pipelines_kvm.yamlhelm/library/cortex/files/crds/cortex.cloud_reservations.yamlinternal/scheduling/external/nova.gointernal/scheduling/external/postgres.gointernal/scheduling/nova/plugins/filters/filter_has_enough_capacity.gointernal/scheduling/reservations/controller/controller.gointernal/scheduling/reservations/failover/config.gointernal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/failover/controller_test.gointernal/scheduling/reservations/failover/helpers.gointernal/scheduling/reservations/failover/integration_test.gointernal/scheduling/reservations/failover/reservation_eligibility.gointernal/scheduling/reservations/failover/reservation_eligibility_test.gointernal/scheduling/reservations/failover/reservation_scheduling.gointernal/scheduling/reservations/failover/reservation_scheduling_test.gointernal/scheduling/reservations/failover/vm_source.gointernal/scheduling/reservations/scheduler_client.gotools/visualize-reservations/main.go
docs/failover-reservations.md
Outdated
| **Why:** Periodically we need to verify that a VM could still evacuate to its reserved host. This sends an evacuation-style scheduling request with only the reservation's host as the eligible target. If the scheduler rejects it, the reservation is no longer valid and | ||
|
|
There was a problem hiding this comment.
Complete the truncated acknowledgment pipeline description.
At Line 159 the sentence ends mid-thought, leaving behavior ambiguous.
Suggested fix
- host as the eligible target. If the scheduler rejects it, the reservation is no longer valid and
+ host as the eligible target. If the scheduler rejects it, the reservation is no longer valid and
+ should be deleted.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Why:** Periodically we need to verify that a VM could still evacuate to its reserved host. This sends an evacuation-style scheduling request with only the reservation's host as the eligible target. If the scheduler rejects it, the reservation is no longer valid and | |
| **Why:** Periodically we need to verify that a VM could still evacuate to its reserved host. This sends an evacuation-style scheduling request with only the reservation's host as the eligible target. If the scheduler rejects it, the reservation is no longer valid and should be deleted. |
🧰 Tools
🪛 LanguageTool
[style] ~159-~159: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ...onciliation). Why: Periodically we need to verify that a VM could still evacuate t...
(REP_NEED_TO_VB)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/failover-reservations.md` around lines 159 - 160, Finish the truncated
sentence describing the periodic evacuation check: complete the acknowledgment
pipeline description by stating that if the scheduler rejects the
evacuation-style scheduling request (which targets only the reservation's host),
the reservation should be marked invalid, the acknowledgment pipeline should
notify the reservation owner/clients of the failure, and trigger any cleanup or
reallocation steps (e.g., release the reserved host and update reservation
state) so downstream systems can respond.
| // notFailoverReservationPredicate filters out failover reservations at the watch level. | ||
| // This prevents the controller from being notified about failover reservations, | ||
| // which are managed by the separate failover controller. | ||
| // Failover reservations are identified by the label v1alpha1.LabelReservationType. | ||
| var notFailoverReservationPredicate = predicate.Funcs{ | ||
| CreateFunc: func(e event.CreateEvent) bool { | ||
| res, ok := e.Object.(*v1alpha1.Reservation) | ||
| if !ok { | ||
| return false | ||
| } | ||
| return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover | ||
| }, | ||
| UpdateFunc: func(e event.UpdateEvent) bool { | ||
| res, ok := e.ObjectNew.(*v1alpha1.Reservation) | ||
| if !ok { | ||
| return false | ||
| } | ||
| return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover | ||
| }, | ||
| DeleteFunc: func(e event.DeleteEvent) bool { | ||
| res, ok := e.Object.(*v1alpha1.Reservation) | ||
| if !ok { | ||
| return false | ||
| } | ||
| return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover | ||
| }, | ||
| GenericFunc: func(e event.GenericEvent) bool { | ||
| res, ok := e.Object.(*v1alpha1.Reservation) | ||
| if !ok { | ||
| return false | ||
| } | ||
| return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover | ||
| }, |
There was a problem hiding this comment.
Predicate should not rely only on the label for failover routing.
At Lines 483/490/497/504, classification is label-only. Failover reservations with missing/legacy labels can be reconciled by this non-failover controller. Use spec.type as primary, with label fallback for compatibility.
Suggested fix
+const legacyReservationTypeLabelKey = "reservations.cortex.sap.com/type"
+
+func isFailoverReservation(res *v1alpha1.Reservation) bool {
+ if res == nil {
+ return false
+ }
+ if res.Spec.Type == v1alpha1.ReservationTypeFailover {
+ return true
+ }
+ if res.Labels[v1alpha1.LabelReservationType] == v1alpha1.ReservationTypeLabelFailover {
+ return true
+ }
+ return res.Labels[legacyReservationTypeLabelKey] == v1alpha1.ReservationTypeLabelFailover
+}
+
var notFailoverReservationPredicate = predicate.Funcs{
CreateFunc: func(e event.CreateEvent) bool {
res, ok := e.Object.(*v1alpha1.Reservation)
if !ok {
return false
}
- return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover
+ return !isFailoverReservation(res)
},
UpdateFunc: func(e event.UpdateEvent) bool {
res, ok := e.ObjectNew.(*v1alpha1.Reservation)
if !ok {
return false
}
- return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover
+ return !isFailoverReservation(res)
},
DeleteFunc: func(e event.DeleteEvent) bool {
res, ok := e.Object.(*v1alpha1.Reservation)
if !ok {
return false
}
- return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover
+ return !isFailoverReservation(res)
},
GenericFunc: func(e event.GenericEvent) bool {
res, ok := e.Object.(*v1alpha1.Reservation)
if !ok {
return false
}
- return res.Labels[v1alpha1.LabelReservationType] != v1alpha1.ReservationTypeLabelFailover
+ return !isFailoverReservation(res)
},
}Also applies to: 520-520
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/controller/controller.go` around lines 473 -
505, The predicate notFailoverReservationPredicate currently uses only labels to
detect failover reservations; change each predicate function (CreateFunc,
UpdateFunc, DeleteFunc, GenericFunc) to first check res.Spec.Type for the
failover type (e.g. res.Spec.Type == v1alpha1.ReservationTypeFailover) and treat
that as failover, and only if Spec.Type is empty/unknown fall back to the
existing label check (res.Labels[v1alpha1.LabelReservationType] ==
v1alpha1.ReservationTypeLabelFailover); invert the result so the predicate
returns false for failover based on Spec.Type or label. Apply the same
Spec-first logic to the other similar predicate referenced near the same block.
| // Check CPU: VM's vcpus must be <= reservation's vcpus | ||
| // Note: Both VM and Reservation use "vcpus" as the resource key | ||
| if vmVCPUs, ok := vm.Resources["vcpus"]; ok { | ||
| if resVCPUs, ok := reservation.Spec.Resources["vcpus"]; ok { | ||
| if vmVCPUs.Cmp(resVCPUs) > 0 { | ||
| return false // VM vcpus exceeds reservation vcpus | ||
| } | ||
| } else { | ||
| return false // Reservation has no vcpus resource defined | ||
| } |
There was a problem hiding this comment.
DoesVMFitInReservation is reading a different CPU key than the builder writes.
buildFailoverReservation() stores reservation CPU under cpu, and ValidateFailoverReservationResources() rejects vcpus. This branch therefore rejects reservations created by the new failover path, so FindEligibleReservations() will not reuse them.
Backward-compatible lookup
// Check CPU: VM's vcpus must be <= reservation's vcpus
- // Note: Both VM and Reservation use "vcpus" as the resource key
+ // Reservation resources now use "cpu"; keep "vcpus" as a temporary fallback if older objects still exist.
if vmVCPUs, ok := vm.Resources["vcpus"]; ok {
- if resVCPUs, ok := reservation.Spec.Resources["vcpus"]; ok {
- if vmVCPUs.Cmp(resVCPUs) > 0 {
+ resCPU, ok := reservation.Spec.Resources["cpu"]
+ if !ok {
+ resCPU, ok = reservation.Spec.Resources["vcpus"]
+ }
+ if ok {
+ if vmVCPUs.Cmp(resCPU) > 0 {
return false // VM vcpus exceeds reservation vcpus
}
} else {
return false // Reservation has no vcpus resource defined
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Check CPU: VM's vcpus must be <= reservation's vcpus | |
| // Note: Both VM and Reservation use "vcpus" as the resource key | |
| if vmVCPUs, ok := vm.Resources["vcpus"]; ok { | |
| if resVCPUs, ok := reservation.Spec.Resources["vcpus"]; ok { | |
| if vmVCPUs.Cmp(resVCPUs) > 0 { | |
| return false // VM vcpus exceeds reservation vcpus | |
| } | |
| } else { | |
| return false // Reservation has no vcpus resource defined | |
| } | |
| // Check CPU: VM's vcpus must be <= reservation's vcpus | |
| // Reservation resources now use "cpu"; keep "vcpus" as a temporary fallback if older objects still exist. | |
| if vmVCPUs, ok := vm.Resources["vcpus"]; ok { | |
| resCPU, ok := reservation.Spec.Resources["cpu"] | |
| if !ok { | |
| resCPU, ok = reservation.Spec.Resources["vcpus"] | |
| } | |
| if ok { | |
| if vmVCPUs.Cmp(resCPU) > 0 { | |
| return false // VM vcpus exceeds reservation vcpus | |
| } | |
| } else { | |
| return false // Reservation has no vcpus resource defined | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/failover/reservation_eligibility.go` around
lines 271 - 280, DoesVMFitInReservation currently checks for VM/Reservation CPU
under the "vcpus" resource key but buildFailoverReservation stores CPUs under
"cpu" (and ValidateFailoverReservationResources expects "vcpus"), causing
created failover reservations to be rejected; update DoesVMFitInReservation to
tolerate both keys (check reservation.Spec.Resources for "vcpus" first, then
fallback to "cpu", or normalize the reservation resources to the canonical
"vcpus" key before comparison) so that FindEligibleReservations can match
reservations created by buildFailoverReservation without changing validation
behavior in ValidateFailoverReservationResources.
| func (c *FailoverReservationController) validateVMViaSchedulerEvacuation( | ||
| ctx context.Context, | ||
| vm VM, | ||
| reservationHost string, | ||
| vmCurrentHost string, | ||
| ) (bool, error) { | ||
| // Get memory and vcpus from VM resources | ||
| var memoryMB uint64 | ||
| var vcpus uint64 | ||
| if memory, ok := vm.Resources["memory"]; ok { | ||
| memoryMB = uint64(memory.Value() / (1024 * 1024)) //nolint:gosec // memory values won't overflow | ||
| } | ||
| if vcpusRes, ok := vm.Resources["vcpus"]; ok { | ||
| vcpus = uint64(vcpusRes.Value()) //nolint:gosec // vcpus values won't overflow | ||
| } | ||
|
|
||
| // Build flavor extra specs from VM's extra specs | ||
| flavorExtraSpecs := make(map[string]string) | ||
| for k, v := range vm.FlavorExtraSpecs { | ||
| flavorExtraSpecs[k] = v | ||
| } | ||
| if _, ok := flavorExtraSpecs["capabilities:hypervisor_type"]; !ok { | ||
| flavorExtraSpecs["capabilities:hypervisor_type"] = "qemu" | ||
| } | ||
|
|
||
| // Build a single-host request to validate the VM can use the reservation host | ||
| scheduleReq := reservations.ScheduleReservationRequest{ | ||
| InstanceUUID: "validate-" + vm.UUID, | ||
| ProjectID: vm.ProjectID, | ||
| FlavorName: vm.FlavorName, | ||
| FlavorExtraSpecs: flavorExtraSpecs, | ||
| MemoryMB: memoryMB, | ||
| VCPUs: vcpus, | ||
| EligibleHosts: []api.ExternalSchedulerHost{{ComputeHost: reservationHost}}, | ||
| IgnoreHosts: []string{vmCurrentHost}, | ||
| Pipeline: PipelineAcknowledgeFailoverReservation, | ||
| AvailabilityZone: vm.AvailabilityZone, | ||
| } |
There was a problem hiding this comment.
Derive the ignored source host from vm.CurrentHypervisor.
This helper takes a second “current host” argument even though VM already carries that field. The caller currently passes the stored allocation host, so after a VM move the scheduler can validate against the wrong source host and incorrectly keep a reservation on the VM’s actual hypervisor. Drop the extra parameter and use vm.CurrentHypervisor directly here.
🛠️ Proposed fix
func (c *FailoverReservationController) validateVMViaSchedulerEvacuation(
ctx context.Context,
vm VM,
reservationHost string,
- vmCurrentHost string,
) (bool, error) {
@@
scheduleReq := reservations.ScheduleReservationRequest{
InstanceUUID: "validate-" + vm.UUID,
ProjectID: vm.ProjectID,
FlavorName: vm.FlavorName,
FlavorExtraSpecs: flavorExtraSpecs,
MemoryMB: memoryMB,
VCPUs: vcpus,
EligibleHosts: []api.ExternalSchedulerHost{{ComputeHost: reservationHost}},
- IgnoreHosts: []string{vmCurrentHost},
+ IgnoreHosts: []string{vm.CurrentHypervisor},
Pipeline: PipelineAcknowledgeFailoverReservation,
AvailabilityZone: vm.AvailabilityZone,
}
@@
log.V(1).Info("validating VM via scheduler evacuation",
"vmUUID", vm.UUID,
"reservationHost", reservationHost,
- "vmCurrentHost", vmCurrentHost,
+ "vmCurrentHost", vm.CurrentHypervisor,
"pipeline", PipelineAcknowledgeFailoverReservation)Update the single caller to stop passing the stale allocation host.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/failover/reservation_scheduling.go` around
lines 166 - 203, The validateVMViaSchedulerEvacuation function is using a
separate vmCurrentHost parameter which can be stale; remove that parameter,
update the function signature of validateVMViaSchedulerEvacuation to only accept
(ctx context.Context, vm VM, reservationHost string), and change the
construction of scheduleReq.IgnoreHosts to use vm.CurrentHypervisor directly
(i.e., IgnoreHosts: []string{vm.CurrentHypervisor}); then update the single
caller site to stop passing the allocation host and call the new signature.
Ensure callers and references to vmCurrentHost are removed and rebuild usages of
reservations.ScheduleReservationRequest and any tests that referenced the old
parameter.
| // Iterate through hypervisor instances | ||
| for _, hv := range hypervisorList.Items { | ||
| for _, inst := range hv.Status.Instances { | ||
| if !inst.Active { | ||
| continue | ||
| } | ||
|
|
||
| // Look up VM data from postgres | ||
| pgVM, existsInPostgres := vmDataByUUID[inst.ID] | ||
| if !existsInPostgres { | ||
| // VM is on hypervisor but not in postgres - skip (need postgres for flavor/size) | ||
| notInPostgresCount++ | ||
| continue | ||
| } | ||
|
|
||
| // Build VM with hypervisor location but postgres data (including AZ) | ||
| vm := VM{ | ||
| UUID: inst.ID, | ||
| FlavorName: pgVM.FlavorName, | ||
| ProjectID: pgVM.ProjectID, | ||
| CurrentHypervisor: hv.Name, // Use hypervisor CRD location, not postgres | ||
| AvailabilityZone: pgVM.AvailabilityZone, | ||
| Resources: pgVM.Resources, | ||
| FlavorExtraSpecs: pgVM.FlavorExtraSpecs, | ||
| } | ||
| result = append(result, vm) | ||
| enrichedCount++ | ||
| } |
There was a problem hiding this comment.
Deduplicate UUIDs when trusting hypervisor location.
If the hypervisor CRDs transiently report the same active instance on two hosts, this loop appends two VM entries with the same UUID. calculateVMsMissingFailover() will then treat them as separate inputs and can create extra reservations for one instance. Track inst.ID here and skip/log duplicates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/scheduling/reservations/failover/vm_source.go` around lines 273 -
300, The loop over hypervisorList may append duplicate VM entries when the same
inst.ID is reported active on multiple hypervisors; modify the code in the
nested loop that builds VM (referencing hypervisorList, hv.Status.Instances,
vmDataByUUID and the VM struct) to maintain a local seen map[string]bool (or
set) of inst.ID values, skip and log any duplicate inst.ID instead of appending,
and ensure only the first trusted hypervisor location for that UUID is added so
calculateVMsMissingFailover() doesn't get duplicate inputs.
| // Apply filters to hypervisors | ||
| var hypervisors hv1.HypervisorList | ||
| filteredHosts := make(map[string]bool) // Track which hosts pass the filter | ||
| for _, hv := range allHypervisors.Items { | ||
| if matchesFilter(hv, *filterName, *filterTrait) { | ||
| hypervisors.Items = append(hypervisors.Items, hv) | ||
| filteredHosts[hv.Name] = true | ||
| } | ||
| } | ||
|
|
||
| // Build map of all VMs from hypervisors | ||
| allVMs := make(map[string]*vmEntry) // vm_uuid -> vmEntry | ||
| for _, hv := range hypervisors.Items { | ||
| for _, inst := range hv.Status.Instances { | ||
| if inst.Active { | ||
| entry := &vmEntry{ | ||
| UUID: inst.ID, | ||
| Host: hv.Name, | ||
| Reservations: []string{}, | ||
| } | ||
| // Check if VM is in server table and get flavor info | ||
| if serverMap != nil { | ||
| if server, ok := serverMap[inst.ID]; ok { | ||
| entry.InServerTable = true | ||
| entry.FlavorName = server.FlavorName | ||
| if flavorMap != nil { | ||
| if flavor, ok := flavorMap[server.FlavorName]; ok { | ||
| entry.VCPUs = flavor.VCPUs | ||
| entry.RAMMb = flavor.RAMMb | ||
| entry.DiskGb = flavor.DiskGb | ||
| } | ||
| } | ||
| } | ||
| } | ||
| allVMs[inst.ID] = entry | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Filtering hypervisors here produces false STALE VMs later.
When --filter-* is set, allVMs only contains VMs whose current host matches the filter. Reservations on the filtered hosts can still reference VMs running elsewhere, and those allocations then fall into the NotOnHypervisors path at Lines 539-563 even though the VM exists on an excluded host. Keep an unfiltered lookup for existence checks, and apply the filter only when rendering views.
Also applies to: 528-563
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/visualize-reservations/main.go` around lines 234 - 271, The current
filter builds `hypervisors` and `allVMs` from only matching hosts, causing
reservations referencing VMs on excluded hosts to be treated as missing;
instead, build an unfiltered VM lookup (e.g., `allVMsFull` or reuse `allVMs`
built from `allHypervisors.Items`) that contains every active VM from
`allHypervisors`, then build a separate filtered `hypervisors` (using
`matchesFilter` and `filteredHosts`) only for rendering; update reservation
checks (the NotOnHypervisors path) to consult the unfiltered VM map
(`allVMsFull`) for existence and flavor info while keeping the filtered list for
view generation.
| connStr := fmt.Sprintf("host=%s port=%s user=%s password=%s dbname=%s sslmode=disable", | ||
| host, port, user, password, database) | ||
|
|
||
| db, err := sql.Open("postgres", connStr) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and examine the code at the specified lines
fd -t f "main.go" tools/visualize-reservations/Repository: cobaltcore-dev/cortex
Length of output: 101
🏁 Script executed:
# Read the file to see the context around lines 1311-1314
wc -l tools/visualize-reservations/main.goRepository: cobaltcore-dev/cortex
Length of output: 106
🏁 Script executed:
# Read the specific section with context
sed -n '1300,1330p' tools/visualize-reservations/main.goRepository: cobaltcore-dev/cortex
Length of output: 1127
🏁 Script executed:
# Check the imports at the top of the file
head -50 tools/visualize-reservations/main.go | grep -E "^import|^package"Repository: cobaltcore-dev/cortex
Length of output: 86
🏁 Script executed:
# Get full import section
sed -n '1,100p' tools/visualize-reservations/main.go | grep -A 50 "^import"Repository: cobaltcore-dev/cortex
Length of output: 1461
🌐 Web query:
lib/pq postgres DSN format unescaped password spaces quotes
💡 Result:
With lib/pq you have two DSN syntaxes; unescaped spaces will break parsing in both.
1) key=value (conninfo) DSN
- If a value contains whitespace (e.g., in
password), wrap it in single quotes. - Inside a quoted value, escape single quotes and backslashes with a backslash:
\'and\\.
Examples (from pq/libpq rules): user=pqgo password='with spaces'user=space\ man password='it\'s valid'(note: the\escapes the space in the unquoted username) [1]
These quoting/escaping rules match libpq conninfo: values with spaces must be single-quoted;\'and\\are the escapes. [2]
2) URL DSN (postgres://user:pass@host/db?...)
In the URL form, use percent-encoding for characters with special meaning in URIs (including spaces and quotes). libpq explicitly allows percent-encoding in any URI part. [2]
Examples:
- space →
%20 '→%27"→%22[2]
Practical tip: if your password contains spaces/quotes, the simplest is usually the key=value form with password='...' and \' / \\ as needed. [1][2]
Sources:
[1] lib/pq DSN docs (pkg.go.dev)
[2] PostgreSQL libpq connection string / URI docs (postgresql.org)
🌐 Web query:
golang pq postgres connection string safe construction url.URL
💡 Result:
To safely construct a lib/pq Postgres DSN with Go’s url.URL, build the URL from structured fields (so reserved characters in usernames/passwords/query values get percent-encoded), then pass u.String() to sql.Open.
import (
"net"
"net/url"
)
func PQURL(host string, port string, dbname string, user string, pass string, sslmode string) (string, error) {
u := &url.URL{
Scheme: "postgres", // or "postgresql" (pq supports both)
User: url.UserPassword(user, pass),
Host: net.JoinHostPort(host, port),
Path: "/" + dbname,
}
q := url.Values{}
if sslmode != "" {
q.Set("sslmode", sslmode)
}
u.RawQuery = q.Encode()
return u.String(), nil
}Key points:
pqaccepts URL-style DSNs:postgres[ql]://[user[:pwd]@][net-location][:port][/dbname][?param1=value1&...]. [1]- Use
url.Values+Encode()to constructRawQuerysafely (it escapes and sorts parameters). [2] - Prefer
net.JoinHostPortforHostso IPv6 hosts are formatted correctly. net/urlis stricter than PostgreSQL/libpq for Unix-socket URLs:postgresql://%2Ftmp%2Fpostgres/dbwon’t work with pq via Go URL parsing; use the keyword DSN form instead:host=/tmp/postgres dbname=db. [1]- Be aware
url.UserPassword(...)comes with a security warning (userinfo-in-URLs is “NOT RECOMMENDED”); consider alternative secret handling where possible. [3]
Sources:
[1] lib/pq “URL connection strings” docs. (pkg.go.dev)
[2] Go net/url Values.Encode docs. (pkg.go.dev)
[3] Go net/url UserPassword docs (security warning). (pkg.go.dev)
Citations:
- 1: https://pkg.go.dev/github.com/lib/pq
- 2: https://pkg.go.dev/net/url%40go1.25.3
- 3: https://pkg.go.dev/net/url%40go1.25.3
Escape secret values before building the lib/pq DSN.
lib/pq's conninfo format (key=value) treats unescaped whitespace, quotes, and backslashes as special characters. Passwords or usernames containing these characters can fail to parse or be split into unexpected options, breaking database connectivity.
Use url.URL with url.UserPassword() and net.JoinHostPort() to properly percent-encode all credential components:
Safer DSN construction
- connStr := fmt.Sprintf("host=%s port=%s user=%s password=%s dbname=%s sslmode=disable",
- host, port, user, password, database)
-
- db, err := sql.Open("postgres", connStr)
+ dsn := (&url.URL{
+ Scheme: "postgres",
+ User: url.UserPassword(user, password),
+ Host: net.JoinHostPort(host, port),
+ Path: database,
+ }).String() + "?sslmode=disable"
+
+ db, err := sql.Open("postgres", dsn)Add "net" and "net/url" to the imports.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/visualize-reservations/main.go` around lines 1311 - 1314, The current
connStr built with fmt.Sprintf can break if user/password contain spaces,
quotes, or backslashes; replace the manual key=value string with a
percent-encoded URL using net/url and net. Construct a url.URL with Scheme
"postgres", User set via url.UserPassword(username, password), Host from
net.JoinHostPort(host, port), Path set to the database name, and RawQuery
"sslmode=disable", then pass u.String() to sql.Open("postgres", connStr). Add
imports "net" and "net/url" and remove the unsafe fmt.Sprintf-based connStr
construction.
PhilippMatthes
left a comment
There was a problem hiding this comment.
Super cool! Sorry if my comments are mainly surface-level, but I'm struggling to do a deep-dive right now. Probably just because its so much code.
| // NewPostgresReaderFromSecretRef creates a new PostgresReader with a direct secret reference. | ||
| func NewPostgresReaderFromSecretRef(c client.Client, secretRef corev1.SecretReference) *PostgresReader { | ||
| return &PostgresReader{ | ||
| Client: c, | ||
| DatabaseSecretRef: secretRef, | ||
| } | ||
| } |
There was a problem hiding this comment.
Seems to not be used anywhere.
| func NewPostgresReader(ctx context.Context, c client.Client, datasourceName string) (*PostgresReader, error) { | ||
| // Look up the Datasource CRD to get the database secret reference | ||
| datasource := &v1alpha1.Datasource{} | ||
| if err := c.Get(ctx, client.ObjectKey{Name: datasourceName}, datasource); err != nil { | ||
| return nil, fmt.Errorf("failed to get datasource %s: %w", datasourceName, err) | ||
| } | ||
|
|
||
| return &PostgresReader{ | ||
| Client: c, | ||
| DatabaseSecretRef: datasource.Spec.DatabaseSecretRef, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
We need to really make sure this is a soft dependency in the scheduling controller manager, so a failing database connection doesn't lead to the scheduling controller manager failing to handle nova external scheduler requests. I have checked the main function, and it seems like the connection is lazily initialized, meaning there shouldn't be an issue during the manager initalization. What we could do though, to really make sure the nova scheduling manager is not impacted, is to separate reservation controllers out into a third deployment. Wdyt?
| } | ||
|
|
||
| // Select executes a SELECT query and returns the results. | ||
| func (r *PostgresReader) Select(ctx context.Context, dest interface{}, query string, args ...interface{}) error { |
There was a problem hiding this comment.
You can use any instead of interface{}. interface{} is outdated go
| // Skip hypervisors with nil or empty capacity. | ||
| if len(hv.Status.Capacity) == 0 { | ||
| traceLog.Debug("skipping hypervisor with nil or empty capacity", "host", hv.Name) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Nicely defensive. We should bubble this up to Info.
| func getFailoverAllocations(res *v1alpha1.Reservation) map[string]string { | ||
| if res.Status.FailoverReservation == nil || res.Status.FailoverReservation.Allocations == nil { | ||
| return map[string]string{} | ||
| } | ||
| return res.Status.FailoverReservation.Allocations | ||
| } |
There was a problem hiding this comment.
var mymap map[string]string // this is nil
_, ok := mymap["bla"] // access works
if !ok {
fmt.Println("not there, but access worked")
}
for range mymap {
// iterating works
}
mymap["bla"] = "bla" // assignment doesn't workFrom all the usages of this function I see none that would need initializing the map, i.e. that uses an assignment to the returned allocations map. So this helper probably isn't needed.
There was a problem hiding this comment.
var mymap map[string]string // this is nil
_, ok := mymap["bla"] // access works
if !ok {
fmt.Println("not there, but access worked")
}
🤯
| reservationToVMs map[string]map[string]bool, | ||
| reservationToHost map[string]string, | ||
| ) { | ||
|
|
There was a problem hiding this comment.
You could make a
type DependencyGraph struct {
// ...
}to nicely encapsulate your data structures
| ensureVMInMaps := func(vmUUID string) { | ||
| if vmToReservations[vmUUID] == nil { | ||
| vmToReservations[vmUUID] = make(map[string]bool) | ||
| } | ||
| if vmToReservationHosts[vmUUID] == nil { | ||
| vmToReservationHosts[vmUUID] = make(map[string]bool) | ||
| } | ||
| } |
There was a problem hiding this comment.
Indeed, there is no defaultdict in golang. Nonetheless, I would still tend towards just writing the if statement multiple times. Otherwise your code becomes really hard to understand because you need to jump back and forth.
|
|
||
| // DoesVMFitInReservation checks if a VM's resources fit within a reservation's resources. | ||
| // Returns true if all VM resources are less than or equal to the reservation's resources. | ||
| func DoesVMFitInReservation(vm VM, reservation v1alpha1.Reservation) bool { |
There was a problem hiding this comment.
This could be doesVMFitInReservation
| *.so | ||
| *.dylib | ||
| *.tgz | ||
|
|
There was a problem hiding this comment.
This is a good addition I'm always thinking of but immediately forgetting to commit
| kind: Pipeline | ||
| metadata: | ||
| name: kvm-valid-host-new-failover-reservation | ||
| spec: | ||
| schedulingDomain: nova | ||
| description: | | ||
| This pipeline is used by the failover reservation controller to find a suitable | ||
| host for creating a new failover reservation. It validates host compatibility | ||
| and checks capacity to ensure the host can accommodate the reservation. | ||
|
|
||
| Use case: When a VM needs failover protection and no existing reservation can | ||
| be reused, this pipeline finds the best host for a new reservation. |
There was a problem hiding this comment.
Why do we need this additional pipeline and can't just push it through the regular one(s)? We want to do weighing on those as well, no?
There was a problem hiding this comment.
Good point this one is probably not needed
There was a problem hiding this comment.
I switch to kvm-general-purpose-load-balancing-all-filters-enabled for now. But now I remember why I added a pipleline here, the code so far has no logic for switching between hana / general propose pipeline
Test Coverage ReportTest Coverage 📊: 66.9% |
There was a problem hiding this comment.
🧹 Nitpick comments (7)
internal/scheduling/reservations/failover/reservation_eligibility_test.go (2)
1263-1264: Unused fieldvmHostMapin TestFindEligibleReservations.Similar to
TestIsVMEligibleForReservation, thevmHostMapfield is declared but never used in the test execution at Line 1350.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go` around lines 1263 - 1264, The struct field vmHostMap declared in TestFindEligibleReservations is unused; remove the vmHostMap field from the test case struct (or populate and use it like in TestIsVMEligibleForReservation if intended) to eliminate the dead field. Update the TestFindEligibleReservations test cases and any references inside the TestFindEligibleReservations function so only the utilized fields (e.g., expectedCount) remain, and run tests to ensure no references to vmHostMap remain.
119-120: Unused fieldvmHostMapin test cases.The
vmHostMapfield is declared in test case structs but never used in the test execution (Line 603). This appears to be dead code from an earlier API design.Remove unused field
testCases := []struct { name string vm VM reservation v1alpha1.Reservation - vmHostMap map[string]string allReservations []v1alpha1.Reservation expected bool }{Then remove all
vmHostMapentries from test cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go` around lines 119 - 120, Remove the dead test field vmHostMap from the test case struct and all test case entries: locate the test cases in reservation_eligibility_test.go where vmHostMap is declared alongside allReservations, delete the vmHostMap field from the struct definition and remove any vmHostMap: ... entries in each test case literal so only used fields (e.g., allReservations) remain; run the tests to ensure no other references to vmHostMap exist and clean up any lingering unused imports or variables.docs/failover-reservations.md (1)
126-133: Consider clarifying constraint 4 and 5 descriptions.The constraint descriptions in the table are dense. Constraint 4 ("VMs sharing slots can't run on each other's hypervisors or slot hosts") and Constraint 5 could benefit from brief examples in a follow-up documentation update.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/failover-reservations.md` around lines 126 - 133, Update the table descriptions for "Constraint 4" and "Constraint 5" to be clearer and add a short concrete example for each: for Constraint 4 (currently "VMs sharing slots can't run on each other's hypervisors or slot hosts") replace with a concise explanation plus an example showing two VMs A and B that share slots and why A cannot failover to B's hypervisor or any hypervisor hosting B's reserved slot; for Constraint 5 (currently "No two VMs using a VM's slots can be on the same hypervisor") provide a clearer phrasing and an example illustrating two dependent VMs C and D that both rely on E's slots and therefore must be placed on distinct hypervisors to preserve evacuation capacity. Refer to the table entries labeled "Constraint 4" and "Constraint 5" when applying the edits.internal/scheduling/reservations/failover/controller.go (3)
602-614: Unconventional error return order.The function signature returns
(error, bool)instead of the idiomatic Go pattern(bool, error). This can confuse callers and doesn't match standard library conventions.Proposed fix
-func (c *FailoverReservationController) reconcileCreateAndAssignReservations( - ... -) (error, bool) { +func (c *FailoverReservationController) reconcileCreateAndAssignReservations( + ... +) (bool, error) { ... - return nil, hitMaxVMsLimit + return hitMaxVMsLimit, nil }And update the caller at Line 350:
- err, hitMaxVMsLimit := c.reconcileCreateAndAssignReservations(...) + hitMaxVMsLimit, err := c.reconcileCreateAndAssignReservations(...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/controller.go` around lines 602 - 614, The function reconcileCreateAndAssignReservations has its return values ordered as (error, bool), which is non-idiomatic in Go; change its signature to return (bool, error) and update its implementation to return hitMaxVMsLimit first and error (nil) second, and then update all call sites (where reconcileCreateAndAssignReservations is invoked) to handle the new return order so the boolean is captured first and the error second; keep the nolint and comment note about the always-nil error but adjust any variable names (e.g., hitMaxVMsLimit) used at call sites to match the new tuple ordering.
848-852: Single-concurrency may become a bottleneck at scale.
MaxConcurrentReconciles: 1ensures safety but means all failover reservation reconciliations are serialized. With many reservations, this could cause significant delays in acknowledgment.Consider whether higher concurrency with more targeted locking (e.g., per-reservation) would be beneficial as the system scales.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/controller.go` around lines 848 - 852, The controller currently forces single-threaded reconciliation via controller.Options{MaxConcurrentReconciles: 1} which serializes all failover reservations; change this by making MaxConcurrentReconciles a higher, configurable value (e.g., from a config flag) when calling controller.Options and Complete(c), and implement targeted per-reservation locking inside your reconciler (e.g., add a reservation lock registry on the controller/reconciler using a sync.Map or map[string]*sync.Mutex guarded by a global mutex, or use singleflight) so each reservation ID is reconciled exclusively while allowing different reservations to run concurrently; ensure locks are acquired early in the reconcile loop and always released (defer Unlock) to avoid deadlocks.
236-246: TODO: Invalidating entire reservation on single VM failure may be too aggressive.The code acknowledges (Line 236-238) that invalidating the entire reservation when one VM fails validation might be overly aggressive, especially for anti-affinity rule violations that may only affect specific VMs.
Would you like me to help design a more granular approach that removes only the failing VM from the reservation instead of deleting the entire reservation?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/reservations/failover/controller.go` around lines 236 - 246, The current code treats any VM validation failure (valid == false) as grounds to invalidate the entire reservation (res.Name); instead, modify the failure-handling branch so that when a single VM (vmUUID) fails validation you remove only that VM from the reservation and persist the change rather than returning false and dropping the whole reservation: locate the block that checks valid and replace the early return with logic that (1) removes the failing vmUUID from the reservation's VM list (e.g., res.Spec.VMs or res.VMs), (2) update the reservation via the controller's persistence call (use the same client/update helper the controller uses elsewhere), (3) emit a more specific log via log.Info including res.Name, vmUUID, vmCurrentHost, reservationHost, and the validation reason, and (4) continue processing (or return a success indication) so only the failing VM is excluded instead of the entire reservation.internal/scheduling/external/postgres.go (1)
74-83: Consider propagating context to database query execution.The
Selectmethod receives a context but only uses it forDB()initialization; the underlyingdatabase.Select()call does not accept context. This means long-running queries cannot be cancelled. This limitation stems fromgorp.DbMap(v2.2.0), which predates context support. To enable context-aware query cancellation, either refactor to usedb.Db.QueryContext()directly on the underlyingsql.DB, or migrate to a context-aware ORM. Note: this pattern is consistent across the codebase in multiple locations wheredb.Select()is called.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/scheduling/external/postgres.go` around lines 74 - 83, PostgresReader.Select currently ignores the passed ctx when calling database.Select, preventing cancellation; replace the call to database.Select with a context-aware execution on the underlying DB (e.g. get the gorp.DbMap from r.DB(ctx) and use its underlying *sql.DB to run QueryContext/QueryRowContext or use database.Db.QueryContext and then scan/scan into dest), or alternatively switch this method to use a context-aware ORM method if available; update PostgresReader.Select to call r.DB(ctx) then perform the query with QueryContext (or equivalent) and populate dest so the ctx is propagated to the DB call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/failover-reservations.md`:
- Around line 126-133: Update the table descriptions for "Constraint 4" and
"Constraint 5" to be clearer and add a short concrete example for each: for
Constraint 4 (currently "VMs sharing slots can't run on each other's hypervisors
or slot hosts") replace with a concise explanation plus an example showing two
VMs A and B that share slots and why A cannot failover to B's hypervisor or any
hypervisor hosting B's reserved slot; for Constraint 5 (currently "No two VMs
using a VM's slots can be on the same hypervisor") provide a clearer phrasing
and an example illustrating two dependent VMs C and D that both rely on E's
slots and therefore must be placed on distinct hypervisors to preserve
evacuation capacity. Refer to the table entries labeled "Constraint 4" and
"Constraint 5" when applying the edits.
In `@internal/scheduling/external/postgres.go`:
- Around line 74-83: PostgresReader.Select currently ignores the passed ctx when
calling database.Select, preventing cancellation; replace the call to
database.Select with a context-aware execution on the underlying DB (e.g. get
the gorp.DbMap from r.DB(ctx) and use its underlying *sql.DB to run
QueryContext/QueryRowContext or use database.Db.QueryContext and then scan/scan
into dest), or alternatively switch this method to use a context-aware ORM
method if available; update PostgresReader.Select to call r.DB(ctx) then perform
the query with QueryContext (or equivalent) and populate dest so the ctx is
propagated to the DB call.
In `@internal/scheduling/reservations/failover/controller.go`:
- Around line 602-614: The function reconcileCreateAndAssignReservations has its
return values ordered as (error, bool), which is non-idiomatic in Go; change its
signature to return (bool, error) and update its implementation to return
hitMaxVMsLimit first and error (nil) second, and then update all call sites
(where reconcileCreateAndAssignReservations is invoked) to handle the new return
order so the boolean is captured first and the error second; keep the nolint and
comment note about the always-nil error but adjust any variable names (e.g.,
hitMaxVMsLimit) used at call sites to match the new tuple ordering.
- Around line 848-852: The controller currently forces single-threaded
reconciliation via controller.Options{MaxConcurrentReconciles: 1} which
serializes all failover reservations; change this by making
MaxConcurrentReconciles a higher, configurable value (e.g., from a config flag)
when calling controller.Options and Complete(c), and implement targeted
per-reservation locking inside your reconciler (e.g., add a reservation lock
registry on the controller/reconciler using a sync.Map or map[string]*sync.Mutex
guarded by a global mutex, or use singleflight) so each reservation ID is
reconciled exclusively while allowing different reservations to run
concurrently; ensure locks are acquired early in the reconcile loop and always
released (defer Unlock) to avoid deadlocks.
- Around line 236-246: The current code treats any VM validation failure (valid
== false) as grounds to invalidate the entire reservation (res.Name); instead,
modify the failure-handling branch so that when a single VM (vmUUID) fails
validation you remove only that VM from the reservation and persist the change
rather than returning false and dropping the whole reservation: locate the block
that checks valid and replace the early return with logic that (1) removes the
failing vmUUID from the reservation's VM list (e.g., res.Spec.VMs or res.VMs),
(2) update the reservation via the controller's persistence call (use the same
client/update helper the controller uses elsewhere), (3) emit a more specific
log via log.Info including res.Name, vmUUID, vmCurrentHost, reservationHost, and
the validation reason, and (4) continue processing (or return a success
indication) so only the failing VM is excluded instead of the entire
reservation.
In `@internal/scheduling/reservations/failover/reservation_eligibility_test.go`:
- Around line 1263-1264: The struct field vmHostMap declared in
TestFindEligibleReservations is unused; remove the vmHostMap field from the test
case struct (or populate and use it like in TestIsVMEligibleForReservation if
intended) to eliminate the dead field. Update the TestFindEligibleReservations
test cases and any references inside the TestFindEligibleReservations function
so only the utilized fields (e.g., expectedCount) remain, and run tests to
ensure no references to vmHostMap remain.
- Around line 119-120: Remove the dead test field vmHostMap from the test case
struct and all test case entries: locate the test cases in
reservation_eligibility_test.go where vmHostMap is declared alongside
allReservations, delete the vmHostMap field from the struct definition and
remove any vmHostMap: ... entries in each test case literal so only used fields
(e.g., allReservations) remain; run the tests to ensure no other references to
vmHostMap exist and clean up any lingering unused imports or variables.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6827bc99-3695-470b-a338-99cf21ae1bd1
📒 Files selected for processing (6)
docs/failover-reservations.mdinternal/scheduling/external/postgres.gointernal/scheduling/nova/plugins/filters/filter_has_enough_capacity.gointernal/scheduling/reservations/failover/controller.gointernal/scheduling/reservations/failover/reservation_eligibility.gointernal/scheduling/reservations/failover/reservation_eligibility_test.go
✅ Files skipped from review due to trivial changes (1)
- internal/scheduling/reservations/failover/reservation_eligibility.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/scheduling/nova/plugins/filters/filter_has_enough_capacity.go
Summary by CodeRabbit
Release Notes
New Features
Documentation
Improvements