Fix exception masking after two getVersion() calls#2816
Fix exception masking after two getVersion() calls#2816
getVersion() calls#2816Conversation
getVersion() calls
When a workflow called getVersion() twice for the same change ID and
then threw, the second invocation went through the skipped-marker path.
That path could invoke
the internal callback twice: once to deliver the version and resume the
workflow, and again from exception handling after the workflow had
already thrown.
WorkflowStateMachines always ran eventLoop() after that callback. On the
second callback, that re-entered the scheduler after the runner was
already closing, which
masked the original workflow exception with Error("closed").
Fix this by making the internal getVersion callback report whether it
actually scheduled a workflow-thread resume. SyncWorkflowContext now
treats the callback as
one-shot per getVersion() call, and WorkflowStateMachines only runs
eventLoop() when that resume was really scheduled. If a duplicate
callback arrives with an exception, rethrow the original exception
instead of swallowing it.
This keeps the legacy same-WFT progress behavior for successful
getVersion() calls, but preserves the underlying workflow failure
instead of replacing it with a
worker-closed error.
test harness visibility
863501b to
cdba487
Compare
|
@claude review |
There was a problem hiding this comment.
The fix is well-motivated and the one-shot AtomicBoolean pattern is clean, but the throw e path in WorkflowStateMachines when a duplicate error callback arrives propagates through state machine transition code in ways that are worth a human expert verifying.
Extended reasoning...
Overview
This PR modifies the getVersion() callback flow across 6 files: ReplayWorkflowContext (interface), ReplayWorkflowContextImpl (thin delegation), WorkflowStateMachines (core state machine logic), SyncWorkflowContext (sync layer), DummySyncWorkflowContext (test stub), and a new regression test. The change upgrades the callback type from Proc2<Integer, RuntimeException> (fire-and-forget) to Func2<Integer, RuntimeException, Boolean> where true requests an extra eventLoop() turn, with backward-compatible default methods for old callers.
Security Risks
No security-sensitive code is touched. There are no auth, crypto, or permissions changes.
Level of Scrutiny
High. This touches WorkflowStateMachines — the central coordinator of workflow replay and execution. The one-shot AtomicBoolean guard in SyncWorkflowContext is straightforward and clearly prevents duplicate scheduler re-entries. However, the else if (e != null) { throw e; } branch in WorkflowStateMachines.getVersion() needs deeper verification: when a duplicate error callback fires (e.g., from notifySkippedExecuting via notifyFromException), this exception propagates out of the lambda, through InvocationStateMachine.notifyFromException(), through state machine transitions, and into prepareImpl()/handleCommand(). Whether this exception is caught at the right level or could itself cause unintended workflow task failures in edge cases requires familiarity with the full execution stack.
Other Factors
The PR includes a targeted regression test and mentions passing GetVersionMultipleCallsTest and VersionNotSupportedWithConflictingRangesExecutionTest. The interface change is backward compatible. The fix correctly handles the null-error duplicate case (does nothing) vs. the non-null-error duplicate case (rethrows). The logic is sound for the described scenario, but the propagation semantics of throw e through the state machine layer warrant a domain-expert review.
| return getVersion( | ||
| changeId, | ||
| minSupported, | ||
| maxSupported, | ||
| (version, exception) -> { | ||
| callback.apply(version, exception); | ||
| return true; | ||
| }); |
There was a problem hiding this comment.
Does this overload need to exist or can the new one just always be used now?
There was a problem hiding this comment.
It's a shim for tests that expect to pass callbacks matching the old prototype. I'm digging into it more but it looks like we could just change the callbacks the tests pass at each point.
There was a problem hiding this comment.
Replaced the shim with a local helper just for those tests.
| String changeId = "change-id"; | ||
| Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); | ||
| Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); | ||
| throw new RuntimeException("Any error"); |
There was a problem hiding this comment.
I'd like to also verify that we haven't changed the behavior of the case of two identical getVersion calls followed by not an error.
My concern is this: Now we won't run the event loop again for the second invocation (not obvious to me if we only won't if hitting an exception, though), that could mean we end up yielding when we should not.
ex:
Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); <-- I think we yield here now on your change
Workflow.sleep( 1 second );
return;
Which would potentially hang the workflow if the workflow task completes there and nothing ever wakes it up again. The thing to confirm is that this code doesn't cause a history that looks like:
WF Start
WFT Scheduled
WFT Started
WFT Completed
Signal Received (you would need to send a signal to "un-stuck" it)
WFT Scheduled
WFT Started
WFT Completed
Timer started
Timer Completed
Workflow completed
Where the whole second WFT should not exist
There was a problem hiding this comment.
I'll add test coverage for this.
There was a problem hiding this comment.
Test added for this case. I confirmed it doesn't yield after the second getVersion().
It re-uses the workflow in GetMultipleVersionCallsTest which makes additional getVersion() calls after the sleep but the pattern is there and the assertions rule out the history pattern above.
Remove the internal WorkflowStateMachines getVersion shim that adapted Proc2 callbacks to the new boolean-returning callback contract. Update VersionStateMachineTest to pass explicit boolean-returning callbacks, using a small test-local helper to preserve the existing event-loop continuation behavior. Add a WorkflowStateMachinesTest that verifies returning false does not trigger an extra event-loop turn.
Add a history-shape assertion to GetVersionMultipleCallsTest covering two identical Workflow.getVersion(...) calls followed by Workflow.sleep(...). The new test verifies that the workflow completes without an external wake-up and that history contains no extra workflow task or signal event before the timer is started. This guards against regressing into a state where the second identical getVersion causes an unintended yield before sleep.
What was changed
Updated the internal
getVersion()callback flow soWorkflowStateMachinesonly runs an extraeventLoop()when the callback actually schedules a workflow-thread resume.
In the failing case, a second
getVersion()call for the same change ID could trigger a duplicate callback from the skipped-marker path after the workflow hadalready thrown. That duplicate callback used to re-enter the scheduler and mask the original workflow exception with a
closederror. The fix makes the sync-layercallback one-shot, skips the extra
eventLoop()for duplicate callbacks, and rethrows the original exception when a duplicate error callback arrives.Also renamed the repro test to
WorkflowFailureGetVersionTestso it follows the module's*Testnaming convention and is picked up by the standard JUnit testtask.
Why?
A workflow that calls
getVersion()twice and then throws should surface the underlying workflow exception. Instead, the secondgetVersion()path couldoverwrite that failure with a secondary worker-closed error, which made the real problem hard to diagnose.
The change preserves the legacy same-WFT behavior for normal
getVersion()progress, while keeping the original workflow failure intact on the duplicate-callbackerror path.
Checklist
Closes Workflow task failed with an internal Java SDK runner-closed error, masking the original workflow exception. #2810 and integrates the minimal repro there into a new regression test
io.temporal.workflow.failure.WorkflowFailureGetVersionTestHow was this tested:
./gradlew :temporal-sdk:test --tests "io.temporal.workflow.failure.WorkflowFailureGetVersionTest"getVersion()call and with twogetVersion()calls./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.VersionNotSupportedWithConflictingRangesExecutionTest"./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest"