Skip to content

OCPBUGS-78191: Exclude disruption during NoExecuteTaintManager serial tests#30855

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
stbenjam:NoExecuteTaintManager
Mar 11, 2026
Merged

OCPBUGS-78191: Exclude disruption during NoExecuteTaintManager serial tests#30855
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
stbenjam:NoExecuteTaintManager

Conversation

@stbenjam
Copy link
Member

@stbenjam stbenjam commented Mar 10, 2026

The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready.

With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold.

Summary by CodeRabbit

  • Bug Fixes
    • Improved disruption analysis by excluding intervals that overlap known disruptive serial tests, reducing false positives in disruption detection.
    • Ensures subsequent test reporting uses the filtered intervals so generated test results reflect the adjusted analysis.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Walkthrough

Adds a filtering step to disruption analysis that removes disruption intervals overlapping with known disruptive e2e tests (tests whose names include "NoExecuteTaintManager") before generating JUnit test results.

Changes

Cohort / File(s) Summary
Disruption Interval Filtering
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go
Adds filterOutKnownDisruptiveTestIntervals to identify disruptive test intervals by SourceE2ETest and test name, uses utility.IntervalsOverlap to remove overlapping error events, and updates EvaluateTestsFromConstructedIntervals to apply this filter before JUnit generation; imports utility.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: filtering out disruption during NoExecuteTaintManager serial tests, which matches the core functionality added to the codebase.
Stable And Deterministic Test Names ✅ Passed Pull request does not introduce any Ginkgo test definitions or dynamic test names.
Test Structure And Quality ✅ Passed PR modifies production code in disruption analysis library, not Ginkgo test code, making the test quality check inapplicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from deads2k and sjenning March 10, 2026 20:12
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go`:
- Around line 254-257: The current Filter in finalIntervals.Filter is dropping
any error Interval entirely when utility.IntervalsOverlap finds any overlap with
entries in knownDisruptiveTests (and monitorapi.IsErrorEvent), which hides
partial real outages; modify the logic in disruption_invariant_adapter.go to,
for each overlapping disruptiveTest interval, trim or split the error Interval
rather than returning false: when disruptiveTest fully covers an error Interval
drop it, when it partially overlaps replace the error Interval with the
non-overlapping left and/or right sub-interval(s) (preserving start/end
timestamps and metadata), and ensure these trimmed intervals are kept in
finalIntervals instead of being filtered out; do this inside the same loop that
references knownDisruptiveTests/utility.IntervalsOverlap so subsequent
disruptive windows are applied to resulting pieces.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dcf0377b-8a41-42ea-a5b2-80814e82660c

📥 Commits

Reviewing files that changed from the base of the PR and between fbba1c2 and e065a14.

📒 Files selected for processing (1)
  • pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go

@stbenjam stbenjam changed the title Exclude disruption during NoExecuteTaintManager serial tests OCPBUGS-78191: Exclude disruption during NoExecuteTaintManager serial tests Mar 10, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 10, 2026
@openshift-ci-robot
Copy link

@stbenjam: This pull request references Jira Issue OCPBUGS-78191, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready.

With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
  • Improved disruption analysis accuracy by filtering out known disruptive test intervals from disruption calculations, preventing false positives in disruption detection.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@stbenjam
Copy link
Member Author

/jira backport release-4.21,release-4.20,release-4.19

@openshift-ci-robot
Copy link

@stbenjam: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19

Details

In response to this:

/jira backport release-4.21,release-4.20,release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: once the present PR merges, I will cherry-pick it on top of release-4.19, release-4.20, release-4.21 in new PRs and assign them to you.

Details

In response to this:

@stbenjam: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19

In response to this:

/jira backport release-4.21,release-4.20,release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

The upstream Kubernetes NoExecuteTaintManager serial test applies
NoExecute taints to worker nodes where its test pods are scheduled.
This evicts all pods on those nodes that lack a matching toleration,
including metrics-server replicas. When both metrics-server pods are
evicted simultaneously, the metrics API (an aggregated API proxied
through kube-apiserver) returns 503 for ~25-30s until replacement
pods become ready.

With 3 worker nodes and metrics-server running 2 replicas on 2 of
them, there is a ~22% chance (2/9) that both test pods land on the
metrics-server nodes, causing both replicas to be evicted at once.
The existing P99-based disruption threshold does not account for
this because serial jobs are a small fraction of total runs in the
historical data, so the P99 is dominated by non-serial jobs that
never encounter this test. The result is a very low baseline (~0-1s)
with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is
intentionally designed to evict pods without tolerations. Filter out
disruption intervals that overlap with the NoExecuteTaintManager test
window so this expected disruption is not counted against the
threshold.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stbenjam stbenjam force-pushed the NoExecuteTaintManager branch from e065a14 to 6dfa76e Compare March 10, 2026 22:29
@openshift-ci-robot
Copy link

@stbenjam: This pull request references Jira Issue OCPBUGS-78191, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready.

With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold.

Summary by CodeRabbit

  • Bug Fixes
  • Improved disruption analysis by excluding intervals that overlap known disruptive serial tests, reducing false positives in disruption detection.
  • Ensures subsequent test reporting uses the filtered intervals so generated test results reflect the adjusted analysis.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stbenjam, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam
Copy link
Member Author

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 10, 2026
@openshift-ci-robot
Copy link

@stbenjam: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go (1)

237-262: Add focused tests for the suppression helper.

This matching/removal rule is subtle enough that a small table-driven test would be worth it: matching vs non-matching SourceE2ETest, overlap vs non-overlap, and a missing LocatorE2ETestKey.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go`
around lines 237 - 262, Add a small table-driven unit test for
filterOutKnownDisruptiveTestIntervals that exercises the key matching/removal
rules: create cases for (1) Interval with Source == monitorapi.SourceE2ETest and
Locator[monitorapi.LocatorE2ETestKey] containing "NoExecuteTaintManager" that
overlaps a disruptive test interval and is an error (should be removed), (2)
same but non-overlapping (should be kept), (3) Source != SourceE2ETest (should
be kept), (4) missing LocatorE2ETestKey (should be kept), and (5) overlapping
but not an error event (monitorapi.IsErrorEvent false, should be kept); use
utility.IntervalsOverlap-compatible intervals and assert the returned
monitorapi.Intervals matches expected kept/removed items for each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go`:
- Around line 237-262: Add a small table-driven unit test for
filterOutKnownDisruptiveTestIntervals that exercises the key matching/removal
rules: create cases for (1) Interval with Source == monitorapi.SourceE2ETest and
Locator[monitorapi.LocatorE2ETestKey] containing "NoExecuteTaintManager" that
overlaps a disruptive test interval and is an error (should be removed), (2)
same but non-overlapping (should be kept), (3) Source != SourceE2ETest (should
be kept), (4) missing LocatorE2ETestKey (should be kept), and (5) overlapping
but not an error event (monitorapi.IsErrorEvent false, should be kept); use
utility.IntervalsOverlap-compatible intervals and assert the returned
monitorapi.Intervals matches expected kept/removed items for each case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc38f007-4530-4cea-8921-469aef1452b2

📥 Commits

Reviewing files that changed from the base of the PR and between e065a14 and 6dfa76e.

📒 Files selected for processing (1)
  • pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@stbenjam
Copy link
Member Author

/override ci/prow/e2e-aws-ovn-microshift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

@stbenjam: Overrode contexts on behalf of stbenjam: ci/prow/e2e-aws-ovn-microshift

Details

In response to this:

/override ci/prow/e2e-aws-ovn-microshift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

@stbenjam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 02b4172 into openshift:main Mar 11, 2026
21 checks passed
@openshift-ci-robot
Copy link

@stbenjam: Jira Issue Verification Checks: Jira Issue OCPBUGS-78191
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-78191 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

The upstream Kubernetes NoExecuteTaintManager serial test applies NoExecute taints to worker nodes where its test pods are scheduled. This evicts all pods on those nodes that lack a matching toleration, including metrics-server replicas. When both metrics-server pods are evicted simultaneously, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 for ~25-30s until replacement pods become ready.

With 3 worker nodes and metrics-server running 2 replicas on 2 of them, there is a ~22% chance (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted at once. The existing P99-based disruption threshold does not account for this because serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter this test. The result is a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations. Filter out disruption intervals that overlap with the NoExecuteTaintManager test window so this expected disruption is not counted against the threshold.

Summary by CodeRabbit

  • Bug Fixes
  • Improved disruption analysis by excluding intervals that overlap known disruptive serial tests, reducing false positives in disruption detection.
  • Ensures subsequent test reporting uses the filtered intervals so generated test results reflect the adjusted analysis.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #30857

Details

In response to this:

@stbenjam: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19

In response to this:

/jira backport release-4.21,release-4.20,release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #30858

Details

In response to this:

@stbenjam: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19

In response to this:

/jira backport release-4.21,release-4.20,release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@openshift-ci-robot: new pull request created: #30859

Details

In response to this:

@stbenjam: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.21
/cherrypick release-4.20
/cherrypick release-4.19

In response to this:

/jira backport release-4.21,release-4.20,release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants