Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562
Fix version-stuck: reconcile SharedDocs via Collaborate.start after provisioner import#4562
Conversation
…n write Removes reconcile_or_reset from Persistence.bind and instead has the provisioner delete stale DocumentState for any workflow it changes. Adds DocumentState.delete_for_document/1 for this purpose.
Remove tests for reconcile_or_reset and reconcile_workflow_metadata, which were deleted from Persistence.bind/3. Replace with tests that verify the new behaviour: persisted state loads as-is, and the provisioner deletes stale DocumentState before writing to the DB. Adds three provisioner tests covering the invalidate_document_states filtering logic (changed, unchanged, and soft-deleted workflows).
Previously, the provisioner tried to handle the "nobody online" case by deleting stale DocumentState rows inside the DB transaction. This left the "someone online" case unaddressed and required the provisioner to know about the online/offline distinction. Replace both paths with a single unified approach: after the DB transaction commits, call WorkflowReconciler.reconcile_workflow_from_db/2 for each updated workflow. The reconciler calls Collaborate.start (joining an existing SharedDoc if one is alive, or starting a fresh one if not), applies a state-driven diff via apply_db_state_to_doc, then calls Session.stop. The shutdown chain flushes the corrected state to DocumentState before any real user opens the workflow. Key changes: - workflow_reconciler.ex: new reconcile_workflow_from_db/2 takes the provisioner actor (User or ProjectRepoConnection) and applies the full DB→Y.Doc diff via Collaborate.start/Session.stop. Broadcasts :workflow_updated_externally on PubSub so connected clients update their baseWorkflow and clear the unsaved-changes indicator. - collaborate.ex: start/1 retries once after a 1ms sleep if Session.init returns :shared_doc_not_found — handles the 0ms auto_exit race where a dying SharedDoc is still registered in :pg when the reconciler's fresh session starts. - session.ex: presence tracking and untracking are now guarded by is_struct(user, User) — ProjectRepoConnection actors skip Presence entirely. SharedDoc.observe in init is wrapped in try/catch :exit so a dying SharedDoc returns a clean stop reason instead of crashing the session. - provisioner.ex: passes user_or_repo_connection through to the reconciler; removes invalidate_document_states and the DocumentState alias. - document_state.ex: removes delete_for_document/1 (no longer needed). - workflow_channel.ex: handle_info for :workflow_updated_externally pushes workflow_saved to each connected client so baseWorkflow stays in sync.
- external_workflow_update_test.exs: all three "someone online" tests now start sessions via Collaborate.start instead of start_supervised! Session directly. DocumentSupervisor is still started via start_supervised! (ExUnit must own it for PersistenceWriter flush ordering), but now receives the Registry name so ensure_doc_supervisor_stopped can find it. Each test calls ensure_doc_supervisor_stopped in-body after the last Session.stop and also registers it as an on_exit safety net — prevents DB connection warnings from PersistenceWriter flushes racing the Ecto sandbox teardown. Adds a new "GitHub sync" test: ProjectRepoConnection actor flows through the provisioner and reconciler correctly, no presence tracking fires, and the next user sees the synced state. - persistence_test.exs: removes the delete_for_document test — that function no longer exists. - provisioner_test.exs: removes the three DocumentState invalidation tests (invalidates-on-change, preserves-on-no-change, skips-on-soft-delete) — the invalidation approach has been replaced by the reconciler.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4562 +/- ##
==========================================
- Coverage 89.52% 89.48% -0.04%
==========================================
Files 441 441
Lines 21125 21264 +139
==========================================
+ Hits 18912 19028 +116
- Misses 2213 2236 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@stuartc would love to hear your thoughts on everything! I will be tying up test cases and doing some more testing tomorrow. |
|
@stuartc I am doing my own review on this now and adding tests for provisioner. Feel free to shout if there's something alarming in the meantime. |
|
I've completed testing of the CLI ✅ |
| |> Enum.flat_map(fn {key, db_val} -> | ||
| case Yex.Map.fetch(ydoc_job, key) do | ||
| {:ok, ^db_val} -> [] | ||
| _ -> [{:set_field, ydoc_job, key, db_val}] |
There was a problem hiding this comment.
What came up in my review:
The _ -> branch can't distinguish "provisioner changed this" from "user has unsaved edits here". DB is truth.
This is acknowledged in the PR (note 2). The correct long-term fix would require the provisioner to write through the Y.Doc, producing real CRDT operations so the CRDT can merge provisioner intent with user unsaved changes. That's a much bigger change and is tracked as a follow-up.
find_index_in_array returns nil when an ID is not found (standard
Enum.find_index behaviour). Without a nil filter, a phantom ID that
can't be located produces a {:delete, array, nil} tuple, and
Yex.Array.delete/2 with a nil index has undefined behaviour — in the
worst case it raises, aborting the entire reconciliation via the rescue
block.
Add Enum.reject(&is_nil/1) after the find_index step in all three
compute ops functions (jobs, edges, triggers) so an unresolvable
phantom is silently skipped rather than crashing the reconciler.
…ler.reconcile_workflow_from_db
|
@stuartc I synced with @elias-ba on this and he flagged about the issue of losing unsaved changes for a user actively editing (which we are already aware of). He's going to dive into the code more and share his thoughts. I think something to bear in mind when reviewing this is to make sure that it won't be too difficult to add something on top of it to make unsaved changes work more nicely (like the rebase we discussed as a follow-on issue). At the very least we might want to notify a user via toast that their workflow has been updated due to external change. He also suggested that we should maybe sync with Joe next week to talk through next steps / priority / if there are next steps as he might have some opinions or thoughts about this already. |
Description
This PR fixes a class of bugs where the collaborative editor ends up stuck on a stale version of a workflow after an external write (sandbox merge or CLI deploy).
The bug
When the provisioner (or sandbox merge) writes a new workflow version to the database, live editor sessions are never notified. Two failure modes:
baseWorkflowis never refreshed.DocumentState(the persisted Y.Doc snapshot) is loaded the next time a user opens the workflow, giving them the pre-merge state. The correct DB content is ignored.The fix
Both cases now go through the same code path.
After the provisioner's DB transaction commits,
WorkflowReconciler.reconcile_workflow_from_db/2is called for each updated workflow. The reconciler callsCollaborate.start/1(joining an existing live SharedDoc if one is running, starting a new one if not), applies a state-based diff viaapply_db_state_to_doc, then callsSession.stop. The shutdown chain flushes the corrected state toDocumentStatebefore any real user opens the workflow.After
Session.stop, the reconciler broadcasts:workflow_updated_externallyon PubSub. Each connectedWorkflowChannelprocess receives this and pushes aworkflow_savedevent to its client — the existing frontend handler updatesbaseWorkflowand clears the unsaved-changes indicator. No frontend changes were needed.Supporting changes:
Collaborate.start/1now retries once (after a 1ms sleep) ifSession.initreturns:shared_doc_not_found. This handles a 0ms auto_exit race: when the reconciler'sSession.stoptriggers the last observer to leave, the SharedDoc dies via a 0ms timer. A subsequentCollaborate.startcan find a dying process still registered in:pgbefore the timer fires. The retry absorbs this and all callers benefit.Sessionnow accepts aProjectRepoConnectionas theuseroption. Presence tracking is guarded byis_struct(user, User)— system actors skip it entirely.The previous approach (
invalidate_document_statesinside the provisioner transaction, which deleted staleDocumentStaterows) is removed. It only handled the nobody-online case and required the provisioner to know about the online/offline distinction.Closes #4535
Validation steps
Someone online during a sandbox merge:
Nobody online during a sandbox merge / CLI deploy:
Provisioner.import_documentwith changesGitHub sync (ProjectRepoConnection actor):
was online
Additional notes for the reviewer
DocumentStaterows in the provisioner transaction). They are superseded by commits 6–7. The final state of the code is what matters; the history reflects the iteration. Happy to squash if you'd prefer a cleaner history.allow_stale: false), DB optimistic locking ensures at most one provisioner succeeds per workflow — so at most one reconciler ever runs. For sandbox merge (allow_stale: true), two concurrent merges to the same workflow could both succeed and both run a reconciler.Collaborate.start/1has a pre-existing TOCTOU race (check-then-act on the SharedDoc lookup) that could cause one reconciler to fail silently; the other would still complete correctly. This is not introduced by this PR and is a realistic edge case only under concurrent sandbox merges. A follow-on improvement would be for the reconciler to re-load the workflow from DB by ID rather than using the struct passed in at transaction time, so the last writer always applies current state.workflowSavedHandlerincreateSessionContextStore.tsalready handlesworkflow_savedby callingsetBaseWorkflow()andsetLatestSnapshotLockVersion().AI Usage
Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):
You can read more details in our
Responsible AI Policy
Pre-submission checklist
/reviewwith Claude Code)
(e.g.,
:owner,:admin,:editor,:viewer)