Skip to content

fix: enforce changeset constraints in run/step handler pipeline#4563

Draft
stuartc wants to merge 2 commits intomainfrom
fix/handler-constraint-enforcement
Draft

fix: enforce changeset constraints in run/step handler pipeline#4563
stuartc wants to merge 2 commits intomainfrom
fix/handler-constraint-enforcement

Conversation

@stuartc
Copy link
Copy Markdown
Member

@stuartc stuartc commented Mar 25, 2026

⚠️ Do not merge

This PR changes update_run/1 from update_all (ignores changeset validity) to Repo.update (enforces it). This means state transition rejections that were always declared in validate_state_change/1 but never enforced are now enforced.

The critical case: :lost:success/:failed is now rejected. In production, this race happens when a worker completes a run after the janitor has already marked it lost. Previously the worker silently won the race. Now it gets an error.

Before merging, we need to decide how to handle this — options include:

  • Allow final → final transitions when the source is :lost (the janitor yields to the worker)
  • Return a graceful "too late" response to the worker instead of an error
  • Accept the new behavior (lost runs stay lost even if the worker finishes)

The full list of transitions that are now actually enforced (previously dead code):

Transition Rule Impact
final → final (e.g. :lost:success) "already in completed state" Production race condition — needs resolution
:available → anything except :claimed "not claimed" Low risk — workers always claim first
:claimed:started without started_at validate_required Low risk — handlers always set it
completion without finished_at validate_required Low risk — handlers always set it

Description

This PR fixes constraint enforcement in the run/step handler pipeline — the hottest code path in the system (worker ↔ channel communication).

Origin: During review of #4531 (Save final state for runs), Taylor left a comment asking whether the Repo.exists? pre-check on final_dataclip_id was necessary, or overkill:

@Stu - is this necessary? I'm worried that it's overkill to check first, but also don't want a situation where we crash the channel cause it's not there.

Investigation revealed the check is necessary — but only as a workaround. The foreign_key_constraint(:final_dataclip_id) declared in Run.complete/2 is dead code because Runs.update_run/1 uses Ecto.Multi.update_all, which silently discards all changeset constraint annotations. Without the Repo.exists? check, a non-existent dataclip ID would crash the channel process with an unhandled Postgrex.Error.

This turned out to be a broader issue affecting two of the four handler paths:

Handler Update method Constraints enforced?
StartRun update_runupdate_all NOvalid? not checked, update_all ignores errors
CompleteRun update_runupdate_all Partial — state machine checked, FK is dead code
StartStep Repo.insert Yes
CompleteStep Repo.update NOStep.finished/2 missing assoc_constraint

Root cause: update_runs/2 was built for Queue.claim's bulk update_all path with a CTE optimization fence. update_run/1 piggybacked on it, destructuring changesets into raw k/v pairs and losing all constraint metadata.

Fixes:

  • Rewrite update_run/1 to use Repo.update (honors all changeset constraints)
  • Restrict update_runs/2 to keyword-list only via guard (Queue bulk path unchanged)
  • Add valid? check in StartRun.call before calling update_run
  • Add assoc_constraint(:output_dataclip) to Step.finished/2
  • Extract broadcast_run_updates/1 as shared side-effect function
  • Keep Repo.exists? pre-check in CompleteRun for wire format compatibility; FK constraint is now a safety net

Related to #4531

Validation steps

  1. Run handler constraint tests: mix test test/lightning/runs/handlers_test.exs (8 pass, 3 skipped — skipped tests document a separate state machine catch-all issue)
  2. Run existing regression suites: mix test test/lightning/runs_test.exs test/lightning_web/channels/run_channel_test.exs (89/89 pass)
  3. mix format --check-formatted && mix credo --strict --all — clean

Additional notes for the reviewer

  1. The Repo.exists? check in CompleteRun.resolve_final_dataclip is intentionally kept. Removing it would change the wire format of the error response to the worker (%{errors: %{field: [msg]}} plain map vs %{field: [msg]} changeset format). The FK constraint now works as defense-in-depth.
  2. Three tests are @tag :skip — they document a pre-existing gap in Run.validate_state_change/1 where the catch-all {_from, _to} -> changeset silently accepts transitions from final states. This is a separate concern.
  3. The security review flagged two pre-existing risks unrelated to this change: get_step not scoped to run_id, and final_dataclip_id existence check not project-scoped. These should be tracked separately.
  4. Queue.claim is unaffected — it already passes keyword lists to update_runs/2.

AI Usage

  • I have used Claude Code
  • I have used another model
  • I have not used AI

Pre-submission checklist

  • I have performed an AI review of my code (we recommend using /review with Claude Code)
  • I have implemented and tested all related authorization policies.
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

update_run/1 was using Ecto.Multi.update_all which silently discards
changeset constraint annotations (foreign_key_constraint, assoc_constraint).
This made constraints declared in Run.complete/2 and Step.finished/2 dead
code. StartRun.call also didn't check changeset.valid? before persisting.

- Rewrite update_run/1 to use Repo.update (honors constraints)
- Restrict update_runs/2 to keyword-list only (Queue bulk path)
- Add valid? check in StartRun.call before calling update_run
- Add assoc_constraint(:output_dataclip) to Step.finished/2
- Extract broadcast_run_updates/1 as shared side-effect function
- Keep Repo.exists? pre-check in CompleteRun for wire format compat
@github-project-automation github-project-automation bot moved this to New Issues in Core Mar 25, 2026
Adds observability to the permissive same-state and catch-all clauses in
validate_state_change/1 before tightening the state machine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: New Issues

Development

Successfully merging this pull request may close these issues.

1 participant