Skip to content

Fix exception masking after two getVersion() calls#2816

Open
eamsden wants to merge 6 commits intomasterfrom
eamsden/runner-closed-error
Open

Fix exception masking after two getVersion() calls#2816
eamsden wants to merge 6 commits intomasterfrom
eamsden/runner-closed-error

Conversation

@eamsden
Copy link
Contributor

@eamsden eamsden commented Mar 25, 2026

What was changed

Updated the internal getVersion() callback flow so WorkflowStateMachines only runs an extra eventLoop() when the callback actually schedules a workflow-
thread resume.

In the failing case, a second getVersion() call for the same change ID could trigger a duplicate callback from the skipped-marker path after the workflow had
already thrown. That duplicate callback used to re-enter the scheduler and mask the original workflow exception with a closed error. The fix makes the sync-layer
callback one-shot, skips the extra eventLoop() for duplicate callbacks, and rethrows the original exception when a duplicate error callback arrives.

Also renamed the repro test to WorkflowFailureGetVersionTest so it follows the module's *Test naming convention and is picked up by the standard JUnit test
task.

Why?

A workflow that calls getVersion() twice and then throws should surface the underlying workflow exception. Instead, the second getVersion() path could
overwrite that failure with a secondary worker-closed error, which made the real problem hard to diagnose.

The change preserves the legacy same-WFT behavior for normal getVersion() progress, while keeping the original workflow failure intact on the duplicate-callback
error path.

Checklist

  1. Closes Workflow task failed with an internal Java SDK runner-closed error, masking the original workflow exception. #2810 and integrates the minimal repro there into a new regression test io.temporal.workflow.failure.WorkflowFailureGetVersionTest

  2. How was this tested:

  • ./gradlew :temporal-sdk:test --tests "io.temporal.workflow.failure.WorkflowFailureGetVersionTest"
  • Verified the repro both ways by running it with one getVersion() call and with two getVersion() calls
  • ./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.VersionNotSupportedWithConflictingRangesExecutionTest"
  • ./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest"
  1. No docs update needed, this is strictly a bugfix.

@eamsden eamsden requested a review from a team as a code owner March 25, 2026 17:42
@eamsden eamsden requested a review from mjameswh March 25, 2026 17:43
@eamsden eamsden changed the title Eamsden/runner closed error Fix exception masking after two getVersion() calls Mar 25, 2026
eamsden added 4 commits March 25, 2026 12:58
When a workflow called getVersion() twice for the same change ID and
then threw, the second invocation went through the skipped-marker path.
That path could invoke
the internal callback twice: once to deliver the version and resume the
workflow, and again from exception handling after the workflow had
already thrown.

WorkflowStateMachines always ran eventLoop() after that callback. On the
second callback, that re-entered the scheduler after the runner was
already closing, which
masked the original workflow exception with Error("closed").

Fix this by making the internal getVersion callback report whether it
actually scheduled a workflow-thread resume. SyncWorkflowContext now
treats the callback as
one-shot per getVersion() call, and WorkflowStateMachines only runs
eventLoop() when that resume was really scheduled. If a duplicate
callback arrives with an exception, rethrow the original exception
instead of swallowing it.

  This keeps the legacy same-WFT progress behavior for successful
  getVersion() calls, but preserves the underlying workflow failure
  instead of replacing it with a
  worker-closed error.
@eamsden eamsden force-pushed the eamsden/runner-closed-error branch from 863501b to cdba487 Compare March 25, 2026 17:59
@eamsden
Copy link
Contributor Author

eamsden commented Mar 25, 2026

@claude review

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is well-motivated and the one-shot AtomicBoolean pattern is clean, but the throw e path in WorkflowStateMachines when a duplicate error callback arrives propagates through state machine transition code in ways that are worth a human expert verifying.

Extended reasoning...

Overview

This PR modifies the getVersion() callback flow across 6 files: ReplayWorkflowContext (interface), ReplayWorkflowContextImpl (thin delegation), WorkflowStateMachines (core state machine logic), SyncWorkflowContext (sync layer), DummySyncWorkflowContext (test stub), and a new regression test. The change upgrades the callback type from Proc2<Integer, RuntimeException> (fire-and-forget) to Func2<Integer, RuntimeException, Boolean> where true requests an extra eventLoop() turn, with backward-compatible default methods for old callers.

Security Risks

No security-sensitive code is touched. There are no auth, crypto, or permissions changes.

Level of Scrutiny

High. This touches WorkflowStateMachines — the central coordinator of workflow replay and execution. The one-shot AtomicBoolean guard in SyncWorkflowContext is straightforward and clearly prevents duplicate scheduler re-entries. However, the else if (e != null) { throw e; } branch in WorkflowStateMachines.getVersion() needs deeper verification: when a duplicate error callback fires (e.g., from notifySkippedExecuting via notifyFromException), this exception propagates out of the lambda, through InvocationStateMachine.notifyFromException(), through state machine transitions, and into prepareImpl()/handleCommand(). Whether this exception is caught at the right level or could itself cause unintended workflow task failures in edge cases requires familiarity with the full execution stack.

Other Factors

The PR includes a targeted regression test and mentions passing GetVersionMultipleCallsTest and VersionNotSupportedWithConflictingRangesExecutionTest. The interface change is backward compatible. The fix correctly handles the null-error duplicate case (does nothing) vs. the non-null-error duplicate case (rethrows). The logic is sound for the described scenario, but the propagation semantics of throw e through the state machine layer warrant a domain-expert review.

Comment on lines +1233 to +1240
return getVersion(
changeId,
minSupported,
maxSupported,
(version, exception) -> {
callback.apply(version, exception);
return true;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this overload need to exist or can the new one just always be used now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shim for tests that expect to pass callbacks matching the old prototype. I'm digging into it more but it looks like we could just change the callbacks the tests pass at each point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the shim with a local helper just for those tests.

Comment on lines +75 to +78
String changeId = "change-id";
Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
throw new RuntimeException("Any error");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to also verify that we haven't changed the behavior of the case of two identical getVersion calls followed by not an error.

My concern is this: Now we won't run the event loop again for the second invocation (not obvious to me if we only won't if hitting an exception, though), that could mean we end up yielding when we should not.

ex:

Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); <-- I think we yield here now on your change
Workflow.sleep( 1 second );
return;

Which would potentially hang the workflow if the workflow task completes there and nothing ever wakes it up again. The thing to confirm is that this code doesn't cause a history that looks like:

WF Start
WFT Scheduled
WFT Started
WFT Completed
Signal Received (you would need to send a signal to "un-stuck" it)
WFT Scheduled
WFT Started
WFT Completed
Timer started
Timer Completed
Workflow completed

Where the whole second WFT should not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add test coverage for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/temporalio/sdk-java/pull/2816/changes#diff-bafdd80f63118cbeef6a19013cd33859580509c829cdfbc4ff8677862caf1740R51-R99

Test added for this case. I confirmed it doesn't yield after the second getVersion().

It re-uses the workflow in GetMultipleVersionCallsTest which makes additional getVersion() calls after the sleep but the pattern is there and the assertions rule out the history pattern above.

eamsden added 2 commits March 26, 2026 09:37
Remove the internal WorkflowStateMachines getVersion shim that adapted
Proc2 callbacks to the new boolean-returning callback contract.

Update VersionStateMachineTest to pass explicit boolean-returning
callbacks, using a small test-local helper to preserve the existing
event-loop continuation behavior. Add a WorkflowStateMachinesTest
that verifies returning false does not trigger an extra event-loop
turn.
Add a history-shape assertion to GetVersionMultipleCallsTest covering
two identical Workflow.getVersion(...) calls followed by
Workflow.sleep(...).

The new test verifies that the workflow completes without an external
wake-up and that history contains no extra workflow task or signal event
before the timer is started. This guards against regressing into a state
where the second identical getVersion causes an unintended yield
before sleep.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workflow task failed with an internal Java SDK runner-closed error, masking the original workflow exception.

2 participants