OCPBUGS-78191: Exclude disruption during NoExecuteTaintManager serial tests#30855
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughAdds a filtering step to disruption analysis that removes disruption intervals overlapping with known disruptive e2e tests (tests whose names include "NoExecuteTaintManager") before generating JUnit test results. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go`:
- Around line 254-257: The current Filter in finalIntervals.Filter is dropping
any error Interval entirely when utility.IntervalsOverlap finds any overlap with
entries in knownDisruptiveTests (and monitorapi.IsErrorEvent), which hides
partial real outages; modify the logic in disruption_invariant_adapter.go to,
for each overlapping disruptiveTest interval, trim or split the error Interval
rather than returning false: when disruptiveTest fully covers an error Interval
drop it, when it partially overlaps replace the error Interval with the
non-overlapping left and/or right sub-interval(s) (preserving start/end
timestamps and metadata), and ensure these trimmed intervals are kept in
finalIntervals instead of being filtered out; do this inside the same loop that
references knownDisruptiveTests/utility.IntervalsOverlap so subsequent
disruptive windows are applied to resulting pieces.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dcf0377b-8a41-42ea-a5b2-80814e82660c
📒 Files selected for processing (1)
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go
Outdated
Show resolved
Hide resolved
|
@stbenjam: This pull request references Jira Issue OCPBUGS-78191, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.21,release-4.20,release-4.19 |
|
@stbenjam: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Scheduling required tests: |
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go
Show resolved
Hide resolved
The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready. With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage. This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e065a14 to
6dfa76e
Compare
|
@stbenjam: This pull request references Jira Issue OCPBUGS-78191, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stbenjam, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by CI |
|
@stbenjam: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go (1)
237-262: Add focused tests for the suppression helper.This matching/removal rule is subtle enough that a small table-driven test would be worth it: matching vs non-matching
SourceE2ETest, overlap vs non-overlap, and a missingLocatorE2ETestKey.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go` around lines 237 - 262, Add a small table-driven unit test for filterOutKnownDisruptiveTestIntervals that exercises the key matching/removal rules: create cases for (1) Interval with Source == monitorapi.SourceE2ETest and Locator[monitorapi.LocatorE2ETestKey] containing "NoExecuteTaintManager" that overlaps a disruptive test interval and is an error (should be removed), (2) same but non-overlapping (should be kept), (3) Source != SourceE2ETest (should be kept), (4) missing LocatorE2ETestKey (should be kept), and (5) overlapping but not an error event (monitorapi.IsErrorEvent false, should be kept); use utility.IntervalsOverlap-compatible intervals and assert the returned monitorapi.Intervals matches expected kept/removed items for each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go`:
- Around line 237-262: Add a small table-driven unit test for
filterOutKnownDisruptiveTestIntervals that exercises the key matching/removal
rules: create cases for (1) Interval with Source == monitorapi.SourceE2ETest and
Locator[monitorapi.LocatorE2ETestKey] containing "NoExecuteTaintManager" that
overlaps a disruptive test interval and is an error (should be removed), (2)
same but non-overlapping (should be kept), (3) Source != SourceE2ETest (should
be kept), (4) missing LocatorE2ETestKey (should be kept), and (5) overlapping
but not an error event (monitorapi.IsErrorEvent false, should be kept); use
utility.IntervalsOverlap-compatible intervals and assert the returned
monitorapi.Intervals matches expected kept/removed items for each case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fc38f007-4530-4cea-8921-469aef1452b2
📒 Files selected for processing (1)
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go
|
Scheduling required tests: |
|
/override ci/prow/e2e-aws-ovn-microshift |
|
@stbenjam: Overrode contexts on behalf of stbenjam: ci/prow/e2e-aws-ovn-microshift DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@stbenjam: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@stbenjam: Jira Issue Verification Checks: Jira Issue OCPBUGS-78191 Jira Issue OCPBUGS-78191 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: new pull request created: #30857 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #30858 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@openshift-ci-robot: new pull request created: #30859 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready.
With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.
This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold.
Summary by CodeRabbit