fix: handle silent schema sync failures in branch-scoped DB middleware#2999
fix: handle silent schema sync failures in branch-scoped DB middleware#2999amikofalvy wants to merge 1 commit intomainfrom
Conversation
Schema sync failures during Dolt branch checkout were silently swallowed,
causing queries to run against branches with outdated schemas (missing
columns like is_work_app), producing 500 errors with unhelpful cause: {}
in logs.
- Capture and inspect checkoutBranch result for sync failures
- Retry schema sync up to 2 times on lock contention
- Log explicit errors when sync fails after retries
- Serialize pg Error cause chain properly (non-enumerable properties)
so actual DB errors (code, message, severity) appear in logs
Made-with: Cursor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
|
|
TL;DR — Fixes silent 500 errors caused by schema sync failures during Dolt branch checkout. The middleware now inspects the Key changes
Summary | 2 files | 1 commit | base: Schema sync retry in branch-scoped DB middleware
The retry loop targets a narrow failure mode: Proper serialization of database error causes
|
There was a problem hiding this comment.
Solid fix for a real production pain point — the root cause analysis on non-enumerable pg Error properties is spot-on. Two actionable issues to address before merging: the retry loop fires on unretryable failures (merge conflicts), and dbError is a misleading field name for a generic cause serializer.
Claude Opus | 𝕏
| while ( | ||
| checkoutResult.schemaSync.hadDifferences && | ||
| !checkoutResult.schemaSync.performed && | ||
| syncAttempts < MAX_SCHEMA_SYNC_RETRIES | ||
| ) { | ||
| syncAttempts++; | ||
| logger.warn( | ||
| { | ||
| branch: resolvedRef.name, | ||
| error: checkoutResult.schemaSync.error, | ||
| attempt: syncAttempts, | ||
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); | ||
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); | ||
| checkoutResult = await checkoutBranch(requestDb)({ | ||
| branchName: resolvedRef.name, | ||
| syncSchema: true, | ||
| autoCommitPending: true, | ||
| }); | ||
| } |
There was a problem hiding this comment.
The retry condition (hadDifferences && !performed) fires on all sync failure modes — lock contention, merge conflicts, uncommitted-changes errors, and generic failures. But the PR description says this targets lock contention specifically. Retrying merge conflicts or "uncommitted changes" errors will fail identically each time, burning 2 retries + backoff for no benefit.
Consider narrowing the retry condition to lock contention only. The SchemaSyncResult already has a skippedDueToLock field, but it's dropped during the mapping to CheckoutBranchResult. Either:
- Forward
skippedDueToLocktoCheckoutBranchResultand gate the retry on it, or - At minimum, only retry when
errorisundefined(which currently signals lock contention since the lock-skip path doesn't set an error string).
| logger.warn( | ||
| { | ||
| branch: resolvedRef.name, | ||
| error: checkoutResult.schemaSync.error, | ||
| attempt: syncAttempts, | ||
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); |
There was a problem hiding this comment.
checkoutResult.schemaSync.error is undefined in the lock contention path — SchemaSyncResult.skippedDueToLock is set but not forwarded to CheckoutBranchResult.schemaSync.error. This warning log will show error: undefined, which makes it hard for operators to understand why the retry happened. Consider logging a meaningful string like 'skipped due to lock contention' when the error field is absent, or better yet, forward skippedDueToLock from the result.
| error: err, | ||
| message: errorMessage, | ||
| stack: errorStack, | ||
| ...(cause && { dbError: cause }), |
There was a problem hiding this comment.
dbError is a misleading field name — serializeCause handles any error's cause chain (HTTP errors, validation wrappers, etc.), not just database errors. Since this is a new field with no downstream consumers, renaming to cause would be more accurate and future-proof. If you want to avoid collision with Pino's own cause handling, errorCause (already used in AgentSession.ts:1825) would be consistent with existing patterns.
| ...(cause && { dbError: cause }), | |
| ...(cause && { errorCause: cause }), |
| { | ||
| error: err, | ||
| message: errorMessage, | ||
| ...(cause && { dbError: cause }), |
There was a problem hiding this comment.
Same field naming concern as above.
| ...(cause && { dbError: cause }), | |
| ...(cause && { errorCause: cause }), |
There was a problem hiding this comment.
PR Review Summary
(3) Total Issues | Risk: Medium
🟠⚠️ Major (1) 🟠⚠️
🟠 1) branchScopedDb.ts:119-128 Schema sync failure allows degraded operation without user feedback
Issue: After exhausting schema sync retries, the middleware logs an error but continues request execution without any indication to the API caller. The user receives a normal response even though subsequent queries may fail unpredictably due to outdated schema (e.g., missing columns like is_work_app).
Why: This creates a debugging nightmare where the symptom (column errors, 500s) is disconnected from the cause (schema sync failure). The log message helps operators but provides no feedback to API consumers. For write operations especially, proceeding with a stale schema is risky since data may be written to incomplete tables.
Fix: Consider one or more of the following:
- Set a response header for observability without breaking the request:
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed) {
logger.error(...);
c.header('X-Schema-Sync-Warning', 'Schema sync failed; queries may use outdated schema');
}- Add span attributes so downstream failures can be correlated via tracing:
import { trace } from '@opentelemetry/api';
// ...
const span = trace.getActiveSpan();
span?.setAttribute('schema_sync.failed', true);
span?.setAttribute('schema_sync.attempts', syncAttempts + 1);- Fail fast for write operations (stricter option):
const isWriteMethod = !['GET', 'HEAD', 'OPTIONS'].includes(method);
if (checkoutResult.schemaSync.hadDifferences && !checkoutResult.schemaSync.performed && isWriteMethod) {
throw new HTTPException(503, { message: 'Schema sync unavailable; please retry' });
}Refs:
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
branchScopedDb.ts:111Retry backoff lacks jitter — add randomization to prevent thundering herd
💭 Consider (1) 💭
💭 1) errorHandler.ts:94-104 Original error object no longer logged directly
Issue: The previous implementation logged error: err directly. The new implementation extracts specific fields (message, stack, dbError) but no longer includes the full error object, which could lose properties that aren't explicitly extracted (e.g., custom properties from third-party libraries like HTTP client errors with response, config fields).
Why: While pg errors are now better handled, other error types may lose context. Low impact since the critical pg properties are now captured.
Fix: Consider keeping the original error object for non-pg errors, or accept this tradeoff since the extracted fields cover the most common debugging needs.
Refs:
🧹 While You're Here (1) 🧹
🧹 1) branchScopedDb.ts:63-70 Silent fallback to manageDbClient masks infrastructure problems
Issue: When getPoolFromClient() returns null (line 65), the code logs an error but falls back to using manageDbClient directly and continues processing. This masks a fundamental infrastructure problem: the database client doesn't have the expected structure. Branch scoping will not work correctly.
Why: If pool extraction fails in production, every subsequent request will silently bypass branch scoping, potentially causing data isolation violations. This is pre-existing behavior, not introduced by this PR, but worth noting since you're already improving error handling in this file.
Fix: Consider throwing a 503 instead of silently proceeding:
if (!pool) {
logger.error({}, 'Could not get connection pool from dbClient - branch scoping unavailable');
throw new HTTPException(503, { message: 'Database connection pool unavailable' });
}Refs:
💡 APPROVE WITH SUGGESTIONS
Summary: This PR effectively addresses a real operational pain point — silent schema sync failures and invisible pg errors were causing 500s that were difficult to diagnose. The retry logic and serializeCause() helper are solid improvements. The main suggestion is to consider adding some form of feedback (header, span attribute, or controlled failure) when schema sync fails so that downstream errors can be correlated with the root cause. The jitter suggestion is a quick win for production reliability.
Discarded (1)
| Location | Issue | Reason Discarded |
|---|---|---|
errorHandler.ts:54-78 |
serializeCause has bounded recursion depth (positive observation) | Not an issue — this was a validation that the implementation correctly bounds recursion at depth 3 to prevent stack overflow |
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
3 | 1 | 1 | 1 | 0 | 0 | 0 |
pr-review-sre |
4 | 0 | 0 | 0 | 1 | 0 | 1 |
pr-review-standards |
0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total | 7 | 1 | 1 | 1 | 1 | 0 | 1 |
Note: The schema sync failure finding from pr-review-errors and pr-review-sre were merged as they identified the same issue from different angles.
| }, | ||
| 'Schema sync not performed, retrying' | ||
| ); | ||
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); |
There was a problem hiding this comment.
🟡 Minor: Retry backoff lacks jitter
Issue: The retry delay uses linear backoff (50 * syncAttempts) without jitter. When multiple concurrent requests hit lock contention simultaneously, they'll all retry at the same intervals (50ms, 100ms), creating synchronized retry waves that can perpetuate contention.
Why: The codebase already uses jitter in similar patterns (withRetry.ts uses Math.random() * expDelay). Lock contention scenarios are particularly susceptible to thundering herd problems when retries are synchronized.
Fix:
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts)); | |
| await new Promise((resolve) => setTimeout(resolve, 50 * syncAttempts * (1 + Math.random() * 0.5))); |
Refs:
Preview URLsUse these stable preview aliases for testing this PR:
These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find. Raw Vercel deployment URLs
|
Ito Test Report ✅16 test cases ran. 4 additional findings, 12 passed. Overall, 16 end-to-end/local API tests ran with 12 passing and 4 failing: baseline routed reads (Agents/Triggers/Settings), branch list/create and duplicate/conflict handling, malformed-input rejection, deletion safety, stale deep-link handling, auth-boundary checks, forced-503 sanitization, and post-stress recovery all behaved as expected without observed data leakage or broad stability regressions. The key defects were a high-severity Project Settings save blocker caused by UI edit-form ID regex validation on a disabled field (also invalidating two-tab overlap expectations), a high-severity concurrent branch-scoped PATCH burst path returning 500s when schema-sync lock retries are exhausted and requests proceed unsynced, and a medium-severity mobile scheduled-trigger creation failure where invalid browser timezone defaults are not validated/caught and escalate to 500. ✅ Passed (12)ℹ️ Additional Findings (4)
|

















Summary
branchScopedDbMiddlewarenever inspected thecheckoutBranchresult. When sync failed (lock contention, merge conflicts from recent DROP TABLE migration), queries ran against branches with outdated schemas (missing columns likeis_work_app), producing 500 errors.message,code,severity) are non-enumerable, soJSON.stringify()producedcause: {}in logs, hiding the actual database error.serializeCause()to the error handler that traverses the error cause chain and extracts non-enumerable pg Error properties (code,message,severity,detail,hint)Operational note
Existing branches that are already out of sync need a one-time manual fix:
Test plan
tsc --noEmit)dbErrorfield in error logs now contains actual pg error details instead of{}pnpm db:manage:sync-all-branchesagainst production DoltgreSQLMade with Cursor