Fix exception masking after two `getVersion()` calls by eamsden · Pull Request #2816 · temporalio/sdk-java

eamsden · 2026-03-25T17:42:49Z

What was changed

Updated the internal getVersion() callback flow so WorkflowStateMachines only runs an extra eventLoop() when the callback actually schedules a workflow-
thread resume.

In the failing case, a second getVersion() call for the same change ID could trigger a duplicate callback from the skipped-marker path after the workflow had
already thrown. That duplicate callback used to re-enter the scheduler and mask the original workflow exception with a closed error. The fix makes the sync-layer
callback one-shot, skips the extra eventLoop() for duplicate callbacks, and rethrows the original exception when a duplicate error callback arrives.

Also renamed the repro test to WorkflowFailureGetVersionTest so it follows the module's *Test naming convention and is picked up by the standard JUnit test
task.

Why?

A workflow that calls getVersion() twice and then throws should surface the underlying workflow exception. Instead, the second getVersion() path could
overwrite that failure with a secondary worker-closed error, which made the real problem hard to diagnose.

The change preserves the legacy same-WFT behavior for normal getVersion() progress, while keeping the original workflow failure intact on the duplicate-callback
error path.

Checklist

Closes Workflow task failed with an internal Java SDK runner-closed error, masking the original workflow exception. #2810 and integrates the minimal repro there into a new regression test io.temporal.workflow.failure.WorkflowFailureGetVersionTest
How was this tested:

./gradlew :temporal-sdk:test --tests "io.temporal.workflow.failure.WorkflowFailureGetVersionTest"
Verified the repro both ways by running it with one getVersion() call and with two getVersion() calls
./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.VersionNotSupportedWithConflictingRangesExecutionTest"
./gradlew :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest"

No docs update needed, this is strictly a bugfix.

When a workflow called getVersion() twice for the same change ID and then threw, the second invocation went through the skipped-marker path. That path could invoke the internal callback twice: once to deliver the version and resume the workflow, and again from exception handling after the workflow had already thrown. WorkflowStateMachines always ran eventLoop() after that callback. On the second callback, that re-entered the scheduler after the runner was already closing, which masked the original workflow exception with Error("closed"). Fix this by making the internal getVersion callback report whether it actually scheduled a workflow-thread resume. SyncWorkflowContext now treats the callback as one-shot per getVersion() call, and WorkflowStateMachines only runs eventLoop() when that resume was really scheduled. If a duplicate callback arrives with an exception, rethrow the original exception instead of swallowing it. This keeps the legacy same-WFT progress behavior for successful getVersion() calls, but preserves the underlying workflow failure instead of replacing it with a worker-closed error.

test harness visibility

eamsden · 2026-03-25T18:18:01Z

@claude review

claude

The fix is well-motivated and the one-shot AtomicBoolean pattern is clean, but the throw e path in WorkflowStateMachines when a duplicate error callback arrives propagates through state machine transition code in ways that are worth a human expert verifying.

Extended reasoning...

Overview

This PR modifies the getVersion() callback flow across 6 files: ReplayWorkflowContext (interface), ReplayWorkflowContextImpl (thin delegation), WorkflowStateMachines (core state machine logic), SyncWorkflowContext (sync layer), DummySyncWorkflowContext (test stub), and a new regression test. The change upgrades the callback type from Proc2<Integer, RuntimeException> (fire-and-forget) to Func2<Integer, RuntimeException, Boolean> where true requests an extra eventLoop() turn, with backward-compatible default methods for old callers.

Security Risks

No security-sensitive code is touched. There are no auth, crypto, or permissions changes.

Level of Scrutiny

High. This touches WorkflowStateMachines — the central coordinator of workflow replay and execution. The one-shot AtomicBoolean guard in SyncWorkflowContext is straightforward and clearly prevents duplicate scheduler re-entries. However, the else if (e != null) { throw e; } branch in WorkflowStateMachines.getVersion() needs deeper verification: when a duplicate error callback fires (e.g., from notifySkippedExecuting via notifyFromException), this exception propagates out of the lambda, through InvocationStateMachine.notifyFromException(), through state machine transitions, and into prepareImpl()/handleCommand(). Whether this exception is caught at the right level or could itself cause unintended workflow task failures in edge cases requires familiarity with the full execution stack.

Other Factors

The PR includes a targeted regression test and mentions passing GetVersionMultipleCallsTest and VersionNotSupportedWithConflictingRangesExecutionTest. The interface change is backward compatible. The fix correctly handles the null-error duplicate case (does nothing) vs. the non-null-error duplicate case (rethrows). The logic is sound for the described scenario, but the propagation semantics of throw e through the state machine layer warrant a domain-expert review.

Sushisource · 2026-03-25T20:54:12Z

temporal-sdk/src/main/java/io/temporal/internal/statemachines/WorkflowStateMachines.java

+    return getVersion(
+        changeId,
+        minSupported,
+        maxSupported,
+        (version, exception) -> {
+          callback.apply(version, exception);
+          return true;
+        });


Does this overload need to exist or can the new one just always be used now?

It's a shim for tests that expect to pass callbacks matching the old prototype. I'm digging into it more but it looks like we could just change the callbacks the tests pass at each point.

Replaced the shim with a local helper just for those tests.

Sushisource · 2026-03-25T21:22:58Z

temporal-sdk/src/test/java/io/temporal/workflow/failure/WorkflowFailureGetVersionTest.java

+      String changeId = "change-id";
+      Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
+      Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1);
+      throw new RuntimeException("Any error");


I'd like to also verify that we haven't changed the behavior of the case of two identical getVersion calls followed by not an error.

My concern is this: Now we won't run the event loop again for the second invocation (not obvious to me if we only won't if hitting an exception, though), that could mean we end up yielding when we should not.

ex:

Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); Workflow.getVersion(changeId, Workflow.DEFAULT_VERSION, 1); <-- I think we yield here now on your change Workflow.sleep( 1 second ); return;

Which would potentially hang the workflow if the workflow task completes there and nothing ever wakes it up again. The thing to confirm is that this code doesn't cause a history that looks like:

WF Start WFT Scheduled WFT Started WFT Completed Signal Received (you would need to send a signal to "un-stuck" it) WFT Scheduled WFT Started WFT Completed Timer started Timer Completed Workflow completed

Where the whole second WFT should not exist

I'll add test coverage for this.

https://github.com/temporalio/sdk-java/pull/2816/changes#diff-bafdd80f63118cbeef6a19013cd33859580509c829cdfbc4ff8677862caf1740R51-R99

Test added for this case. I confirmed it doesn't yield after the second getVersion().

It re-uses the workflow in GetMultipleVersionCallsTest which makes additional getVersion() calls after the sleep but the pattern is there and the assertions rule out the history pattern above.

Remove the internal WorkflowStateMachines getVersion shim that adapted Proc2 callbacks to the new boolean-returning callback contract. Update VersionStateMachineTest to pass explicit boolean-returning callbacks, using a small test-local helper to preserve the existing event-loop continuation behavior. Add a WorkflowStateMachinesTest that verifies returning false does not trigger an extra event-loop turn.

Add a history-shape assertion to GetVersionMultipleCallsTest covering two identical Workflow.getVersion(...) calls followed by Workflow.sleep(...). The new test verifies that the workflow completes without an external wake-up and that history contains no extra workflow task or signal event before the timer is started. This guards against regressing into a state where the second identical getVersion causes an unintended yield before sleep.

eamsden requested a review from a team as a code owner March 25, 2026 17:42

eamsden requested a review from mjameswh March 25, 2026 17:43

eamsden changed the title ~~Eamsden/runner closed error~~ Fix exception masking after two getVersion() calls Mar 25, 2026

eamsden added 4 commits March 25, 2026 12:58

Repro as test for #2810

b4e38bf

Test now succeeds when proper exception is returned.

fa825b2

Rename WorkflowFailureGetVersion -> WorkflowFailureGetVersionTest for

cdba487

test harness visibility

eamsden force-pushed the eamsden/runner-closed-error branch from 863501b to cdba487 Compare March 25, 2026 17:59

claude bot reviewed Mar 25, 2026

View reviewed changes

Sushisource reviewed Mar 25, 2026

View reviewed changes

eamsden added 2 commits March 26, 2026 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix exception masking after two `getVersion()` calls#2816

Fix exception masking after two `getVersion()` calls#2816
eamsden wants to merge 6 commits intomasterfrom
eamsden/runner-closed-error

eamsden commented Mar 25, 2026

Uh oh!

eamsden commented Mar 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

Sushisource Mar 25, 2026

Uh oh!

eamsden Mar 26, 2026

Uh oh!

eamsden Mar 26, 2026

Uh oh!

Sushisource Mar 25, 2026

Uh oh!

eamsden Mar 26, 2026

Uh oh!

eamsden Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eamsden commented Mar 25, 2026

What was changed

Why?

Checklist

Uh oh!

eamsden commented Mar 25, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security Risks

Level of Scrutiny

Other Factors

Uh oh!

Sushisource Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

eamsden Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

eamsden Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Sushisource Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

eamsden Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

eamsden Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants