OCPBUGS-78016: Fix double counting of pod restart events#30862
OCPBUGS-78016: Fix double counting of pod restart events#30862xueqzhan wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-78016, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughIntroduced an Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/monitortests/node/watchpods/collection.go`:
- Around line 221-253: The two code paths for recording container exit in the
LastTerminationState branch are inconsistent: in the !alreadyRecorded branch the
ExitCode == 0 path uses monitorapi.Info while the other (existing) path uses
monitorapi.Error; change the new branch to use monitorapi.Error instead of
monitorapi.Info so both paths use the same severity. Locate the !alreadyRecorded
block that constructs intervals with
containerStatus.LastTerminationState.Terminated and replace the monitorapi.Info
call in the ExitCode == 0 branch with monitorapi.Error so the interval level
matches the existing behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 45e82f29-a3bb-4691-ae5b-2d62d2989052
📒 Files selected for processing (1)
pkg/monitortests/node/watchpods/collection.go
| // if we are transitioning to a terminated state in LastTerminationState | ||
| // Check if we already recorded this exit when it was in State.Terminated | ||
| // If oldContainerStatus.State.Terminated matches the current LastTerminationState.Terminated, | ||
| // then we already recorded this exit and should skip to avoid double-counting | ||
| alreadyRecorded := oldContainerStatus.State.Terminated != nil && | ||
| oldContainerStatus.State.Terminated.FinishedAt.Equal(&containerStatus.LastTerminationState.Terminated.FinishedAt) | ||
|
|
||
| if !alreadyRecorded { | ||
| // We missed the original exit event, record it now as a safety net | ||
| if containerStatus.LastTerminationState.Terminated.ExitCode != 0 { | ||
| intervals = append(intervals, | ||
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error). | ||
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | ||
| Message(monitorapi.NewMessage(). | ||
| Reason(monitorapi.ContainerReasonContainerExit). | ||
| WithAnnotation(monitorapi.AnnotationContainerExitCode, fmt.Sprintf("%d", containerStatus.LastTerminationState.Terminated.ExitCode)). | ||
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | ||
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message), | ||
| ).BuildNow(), | ||
| ) | ||
| } else { | ||
| intervals = append(intervals, | ||
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Info). | ||
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | ||
| Message(monitorapi.NewMessage(). | ||
| Reason(monitorapi.ContainerReasonContainerExit). | ||
| WithAnnotation(monitorapi.AnnotationContainerExitCode, "0"). | ||
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | ||
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message)). | ||
| BuildNow(), | ||
| ) | ||
| } | ||
| } |
There was a problem hiding this comment.
Inconsistent interval level for exit code 0 between new and existing code paths.
The alreadyRecorded guard logic correctly prevents double-counting by comparing FinishedAt timestamps.
However, there's an inconsistency in interval levels:
- New code (line 243): Uses
monitorapi.InfowhenExitCode == 0 - Existing code (line 271): Uses
monitorapi.ErrorwhenExitCode == 0
This means a successful container exit (code 0) could be logged as Info or Error depending on which code path fires first, leading to inconsistent monitoring data.
Please align the interval levels—either both should use Info for successful exits, or both should use Error.
🔧 Proposed fix to align with existing behavior (use Error)
} else {
intervals = append(intervals,
- monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Info).
+ monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error).
Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)).
Message(monitorapi.NewMessage().
Reason(monitorapi.ContainerReasonContainerExit).🔧 Alternative fix: Update existing code to use Info for exit code 0
If Info is semantically correct for successful exits, update line 271 as well:
} else {
intervals = append(intervals,
- monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error).
+ monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Info).
Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)).
Message(monitorapi.NewMessage().
Reason(monitorapi.ContainerReasonContainerExit).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // if we are transitioning to a terminated state in LastTerminationState | |
| // Check if we already recorded this exit when it was in State.Terminated | |
| // If oldContainerStatus.State.Terminated matches the current LastTerminationState.Terminated, | |
| // then we already recorded this exit and should skip to avoid double-counting | |
| alreadyRecorded := oldContainerStatus.State.Terminated != nil && | |
| oldContainerStatus.State.Terminated.FinishedAt.Equal(&containerStatus.LastTerminationState.Terminated.FinishedAt) | |
| if !alreadyRecorded { | |
| // We missed the original exit event, record it now as a safety net | |
| if containerStatus.LastTerminationState.Terminated.ExitCode != 0 { | |
| intervals = append(intervals, | |
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error). | |
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | |
| Message(monitorapi.NewMessage(). | |
| Reason(monitorapi.ContainerReasonContainerExit). | |
| WithAnnotation(monitorapi.AnnotationContainerExitCode, fmt.Sprintf("%d", containerStatus.LastTerminationState.Terminated.ExitCode)). | |
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | |
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message), | |
| ).BuildNow(), | |
| ) | |
| } else { | |
| intervals = append(intervals, | |
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Info). | |
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | |
| Message(monitorapi.NewMessage(). | |
| Reason(monitorapi.ContainerReasonContainerExit). | |
| WithAnnotation(monitorapi.AnnotationContainerExitCode, "0"). | |
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | |
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message)). | |
| BuildNow(), | |
| ) | |
| } | |
| } | |
| // if we are transitioning to a terminated state in LastTerminationState | |
| // Check if we already recorded this exit when it was in State.Terminated | |
| // If oldContainerStatus.State.Terminated matches the current LastTerminationState.Terminated, | |
| // then we already recorded this exit and should skip to avoid double-counting | |
| alreadyRecorded := oldContainerStatus.State.Terminated != nil && | |
| oldContainerStatus.State.Terminated.FinishedAt.Equal(&containerStatus.LastTerminationState.Terminated.FinishedAt) | |
| if !alreadyRecorded { | |
| // We missed the original exit event, record it now as a safety net | |
| if containerStatus.LastTerminationState.Terminated.ExitCode != 0 { | |
| intervals = append(intervals, | |
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error). | |
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | |
| Message(monitorapi.NewMessage(). | |
| Reason(monitorapi.ContainerReasonContainerExit). | |
| WithAnnotation(monitorapi.AnnotationContainerExitCode, fmt.Sprintf("%d", containerStatus.LastTerminationState.Terminated.ExitCode)). | |
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | |
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message), | |
| ).BuildNow(), | |
| ) | |
| } else { | |
| intervals = append(intervals, | |
| monitorapi.NewInterval(monitorapi.SourcePodMonitor, monitorapi.Error). | |
| Locator(monitorapi.NewLocator().ContainerFromPod(pod, containerName)). | |
| Message(monitorapi.NewMessage(). | |
| Reason(monitorapi.ContainerReasonContainerExit). | |
| WithAnnotation(monitorapi.AnnotationContainerExitCode, "0"). | |
| Cause(containerStatus.LastTerminationState.Terminated.Reason). | |
| HumanMessage(containerStatus.LastTerminationState.Terminated.Message)). | |
| BuildNow(), | |
| ) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/monitortests/node/watchpods/collection.go` around lines 221 - 253, The
two code paths for recording container exit in the LastTerminationState branch
are inconsistent: in the !alreadyRecorded branch the ExitCode == 0 path uses
monitorapi.Info while the other (existing) path uses monitorapi.Error; change
the new branch to use monitorapi.Error instead of monitorapi.Info so both paths
use the same severity. Locate the !alreadyRecorded block that constructs
intervals with containerStatus.LastTerminationState.Terminated and replace the
monitorapi.Info call in the ExitCode == 0 branch with monitorapi.Error so the
interval level matches the existing behavior.
There was a problem hiding this comment.
This wasn't really new code. It just puts the old code in a new scope.
There was a problem hiding this comment.
@xueqzhan, you're right — looking at the diff more carefully, the monitorapi.Info for exit code 0 was already present in the original code before this PR. The change only wrapped the existing logic in the if !alreadyRecorded guard without altering the interval levels. My comment was incorrect. Sorry for the noise!
(ᵕ—ᴗ—)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips |
|
@xueqzhan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6c7edd10-1d62-11f1-90ca-56d2448f888b-0 |
|
Scheduling required tests: |
Summary by CodeRabbit
Release Notes