Skip to content

Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562

Open
lmac-1 wants to merge 12 commits intomainfrom
version-stuck-provisioner-2
Open

Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562
lmac-1 wants to merge 12 commits intomainfrom
version-stuck-provisioner-2

Conversation

@lmac-1
Copy link
Copy Markdown
Collaborator

@lmac-1 lmac-1 commented Mar 24, 2026

Description

This PR fixes a class of bugs where the collaborative editor ends up stuck on a stale version of a workflow after an external write (sandbox merge or CLI deploy).

The bug

When the provisioner (or sandbox merge) writes a new workflow version to the database, live editor sessions are never notified. Two failure modes:

  1. Someone is online: the SharedDoc stays alive but is never updated. Users with the workflow open see the pre-merge state, and their next save silently overwrites the provisioner's work. The unsaved-changes indicator (red dot) stays on permanently because baseWorkflow is never refreshed.
  2. Nobody is online: the stale DocumentState (the persisted Y.Doc snapshot) is loaded the next time a user opens the workflow, giving them the pre-merge state. The correct DB content is ignored.

The fix

Both cases now go through the same code path.

After the provisioner's DB transaction commits, WorkflowReconciler.reconcile_workflow_from_db/2 is called for each updated workflow. The reconciler calls Collaborate.start/1 (joining an existing live SharedDoc if one is running, starting a new one if not), applies a state-based diff via apply_db_state_to_doc, then calls Session.stop. The shutdown chain flushes the corrected state to DocumentState before any real user opens the workflow.

After Session.stop, the reconciler broadcasts :workflow_updated_externally on PubSub. Each connected WorkflowChannel process receives this and pushes a workflow_saved event to its client — the existing frontend handler updates baseWorkflow and clears the unsaved-changes indicator. No frontend changes were needed.

Supporting changes:

  • Collaborate.start/1 now retries once (after a 1ms sleep) if Session.init returns :shared_doc_not_found. This handles a 0ms auto_exit race: when the reconciler's Session.stop triggers the last observer to leave, the SharedDoc dies via a 0ms timer. A subsequent Collaborate.start can find a dying process still registered in :pg before the timer fires. The retry absorbs this and all callers benefit.

  • Session now accepts a ProjectRepoConnection as the user option. Presence tracking is guarded by is_struct(user, User) — system actors skip it entirely.

  • The previous approach (invalidate_document_states inside the provisioner transaction, which deleted stale DocumentState rows) is removed. It only handled the nobody-online case and required the provisioner to know about the online/offline distinction.

Closes #4535

Validation steps

  1. Someone online during a sandbox merge:

    • Open a workflow in the editor (Tab A)
    • Make a sandbox for the project in Tab B
    • In Tab A, make some changes to the workflow but don't save.
    • In Tab B, make a different change to the workflow, save and then merge the sandbox.
    • Confirm Tab A's Y.Doc is updated in real time (new jobs/edges appear) unsaved changes will be deleted, this is expected
    • Confirm the unsaved-changes indicator (red dot) clears after the merge
    • Confirm saving from Tab A produces the merged state, not the pre-merge state
  2. Nobody online during a sandbox merge / CLI deploy:

    • Make sure the workflow has some unsaved changes and then close it
    • Ensure no sessions are open for the workflow (close all tabs)
    • Run a sandbox merge or Provisioner.import_document with changes
    • Open the workflow in a new tab
    • Confirm the new tab shows the merged state, not the pre-merge state
  3. GitHub sync (ProjectRepoConnection actor):

    • Trigger a GitHub sync that updates a workflow
    • Confirm the same behaviour as steps 1 and 2 depending on whether anyone
      was online

Additional notes for the reviewer

  1. @stuartc, if the reconciler part is not correct, I might need you to jump in to help a bit as it's a bit over my head and I fear we will be going in circles.
  2. At the moment, if a user is active with unsaved changes when an external change happens, their changes will be lost. This PR does not address this but can be discussed in a follow-up issue to improve the user experience.
  3. Commits 1–5 on this branch are the old approach (deleting DocumentState rows in the provisioner transaction). They are superseded by commits 6–7. The final state of the code is what matters; the history reflects the iteration. Happy to squash if you'd prefer a cleaner history.
  4. Concurrent sandbox merges (not fixed here): For CLI deployments (allow_stale: false), DB optimistic locking ensures at most one provisioner succeeds per workflow — so at most one reconciler ever runs. For sandbox merge (allow_stale: true), two concurrent merges to the same workflow could both succeed and both run a reconciler. Collaborate.start/1 has a pre-existing TOCTOU race (check-then-act on the SharedDoc lookup) that could cause one reconciler to fail silently; the other would still complete correctly. This is not introduced by this PR and is a realistic edge case only under concurrent sandbox merges. A follow-on improvement would be for the reconciler to re-load the workflow from DB by ID rather than using the struct passed in at transaction time, so the last writer always applies current state.
  5. No frontend changes. The existing workflowSavedHandler increateSessionContextStore.ts already handles workflow_saved by callingsetBaseWorkflow() and setLatestSnapshotLockVersion().

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review
    with Claude Code)
  • I have implemented and tested all related authorization policies.
    (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

lmac-1 added 4 commits March 24, 2026 12:50
…n write

Removes reconcile_or_reset from Persistence.bind and instead has the
provisioner delete stale DocumentState for any workflow it changes.
Adds DocumentState.delete_for_document/1 for this purpose.
@github-project-automation github-project-automation bot moved this to New Issues in Core Mar 24, 2026
lmac-1 added 3 commits March 24, 2026 20:38
Remove tests for reconcile_or_reset and reconcile_workflow_metadata, which
were deleted from Persistence.bind/3. Replace with tests that verify the
new behaviour: persisted state loads as-is, and the provisioner deletes
stale DocumentState before writing to the DB.

Adds three provisioner tests covering the invalidate_document_states
filtering logic (changed, unchanged, and soft-deleted workflows).
Previously, the provisioner tried to handle the "nobody online" case by
deleting stale DocumentState rows inside the DB transaction. This left the
"someone online" case unaddressed and required the provisioner to know about
the online/offline distinction.

Replace both paths with a single unified approach: after the DB transaction
commits, call WorkflowReconciler.reconcile_workflow_from_db/2 for each
updated workflow. The reconciler calls Collaborate.start (joining an existing
SharedDoc if one is alive, or starting a fresh one if not), applies a
state-driven diff via apply_db_state_to_doc, then calls Session.stop. The
shutdown chain flushes the corrected state to DocumentState before any real
user opens the workflow.

Key changes:
- workflow_reconciler.ex: new reconcile_workflow_from_db/2 takes the
  provisioner actor (User or ProjectRepoConnection) and applies the full
  DB→Y.Doc diff via Collaborate.start/Session.stop. Broadcasts
  :workflow_updated_externally on PubSub so connected clients update their
  baseWorkflow and clear the unsaved-changes indicator.
- collaborate.ex: start/1 retries once after a 1ms sleep if Session.init
  returns :shared_doc_not_found — handles the 0ms auto_exit race where a
  dying SharedDoc is still registered in :pg when the reconciler's fresh
  session starts.
- session.ex: presence tracking and untracking are now guarded by
  is_struct(user, User) — ProjectRepoConnection actors skip Presence entirely.
  SharedDoc.observe in init is wrapped in try/catch :exit so a dying SharedDoc
  returns a clean stop reason instead of crashing the session.
- provisioner.ex: passes user_or_repo_connection through to the reconciler;
  removes invalidate_document_states and the DocumentState alias.
- document_state.ex: removes delete_for_document/1 (no longer needed).
- workflow_channel.ex: handle_info for :workflow_updated_externally pushes
  workflow_saved to each connected client so baseWorkflow stays in sync.
- external_workflow_update_test.exs: all three "someone online" tests now
  start sessions via Collaborate.start instead of start_supervised! Session
  directly. DocumentSupervisor is still started via start_supervised! (ExUnit
  must own it for PersistenceWriter flush ordering), but now receives the
  Registry name so ensure_doc_supervisor_stopped can find it. Each test calls
  ensure_doc_supervisor_stopped in-body after the last Session.stop and also
  registers it as an on_exit safety net — prevents DB connection warnings from
  PersistenceWriter flushes racing the Ecto sandbox teardown.
  Adds a new "GitHub sync" test: ProjectRepoConnection actor flows through the
  provisioner and reconciler correctly, no presence tracking fires, and the
  next user sees the synced state.
- persistence_test.exs: removes the delete_for_document test — that function
  no longer exists.
- provisioner_test.exs: removes the three DocumentState invalidation tests
  (invalidates-on-change, preserves-on-no-change, skips-on-soft-delete) — the
  invalidation approach has been replaced by the reconciler.
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 92.53731% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.48%. Comparing base (cdb9b04) to head (8d42984).

Files with missing lines Patch % Lines
lib/lightning/collaboration/workflow_reconciler.ex 92.02% 13 Missing ⚠️
lib/lightning/collaboration/session.ex 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
- Coverage   89.52%   89.48%   -0.04%     
==========================================
  Files         441      441              
  Lines       21125    21264     +139     
==========================================
+ Hits        18912    19028     +116     
- Misses       2213     2236      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lmac-1 lmac-1 changed the title [draft] version stuck Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import Mar 25, 2026
@lmac-1
Copy link
Copy Markdown
Collaborator Author

lmac-1 commented Mar 25, 2026

@stuartc would love to hear your thoughts on everything! I will be tying up test cases and doing some more testing tomorrow.

@theroinaochieng theroinaochieng requested a review from stuartc March 26, 2026 08:17
@lmac-1 lmac-1 marked this pull request as ready for review March 27, 2026 07:54
@lmac-1
Copy link
Copy Markdown
Collaborator Author

lmac-1 commented Mar 27, 2026

@stuartc I am doing my own review on this now and adding tests for provisioner. Feel free to shout if there's something alarming in the meantime.

@lmac-1
Copy link
Copy Markdown
Collaborator Author

lmac-1 commented Mar 27, 2026

I've completed testing of the CLI ✅

|> Enum.flat_map(fn {key, db_val} ->
case Yex.Map.fetch(ydoc_job, key) do
{:ok, ^db_val} -> []
_ -> [{:set_field, ydoc_job, key, db_val}]
Copy link
Copy Markdown
Collaborator Author

@lmac-1 lmac-1 Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What came up in my review:

The _ -> branch can't distinguish "provisioner changed this" from "user has unsaved edits here". DB is truth.

This is acknowledged in the PR (note 2). The correct long-term fix would require the provisioner to write through the Y.Doc, producing real CRDT operations so the CRDT can merge provisioner intent with user unsaved changes. That's a much bigger change and is tracked as a follow-up.

lmac-1 added 3 commits March 27, 2026 08:54
find_index_in_array returns nil when an ID is not found (standard
Enum.find_index behaviour). Without a nil filter, a phantom ID that
can't be located produces a {:delete, array, nil} tuple, and
Yex.Array.delete/2 with a nil index has undefined behaviour — in the
worst case it raises, aborting the entire reconciliation via the rescue
block.

Add Enum.reject(&is_nil/1) after the find_index step in all three
compute ops functions (jobs, edges, triggers) so an unresolvable
phantom is silently skipped rather than crashing the reconciler.
@lmac-1
Copy link
Copy Markdown
Collaborator Author

lmac-1 commented Mar 27, 2026

@elias-ba @stuartc I have not manually tested the validation steps for GitHub sync that updates a workflow still. Will work out how to test this properly later this morning.

@lmac-1
Copy link
Copy Markdown
Collaborator Author

lmac-1 commented Mar 27, 2026

@stuartc I synced with @elias-ba on this and he flagged about the issue of losing unsaved changes for a user actively editing (which we are already aware of). He's going to dive into the code more and share his thoughts.

I think something to bear in mind when reviewing this is to make sure that it won't be too difficult to add something on top of it to make unsaved changes work more nicely (like the rebase we discussed as a follow-on issue). At the very least we might want to notify a user via toast that their workflow has been updated due to external change.

He also suggested that we should maybe sync with Joe next week to talk through next steps / priority / if there are next steps as he might have some opinions or thoughts about this already.

@lmac-1 lmac-1 requested a review from elias-ba March 27, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

Merging sandbox into workflow with unsaved changes sometimes causes stuck version

1 participant