Defer ChainMonitor updates and persistence to flush()#4351
Defer ChainMonitor updates and persistence to flush()#4351joostjager wants to merge 3 commits intolightningdevkit:mainfrom
Conversation
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
|
Closing this PR as #4345 seems to be the easiest way to go |
1f5cef4 to
30d05ca
Compare
|
The single commit was split into three: extracting internal methods, adding a deferred toggle, and implementing the deferral and flushing logic. flush() now delegates to the extracted internal methods rather than reimplementing persist/insert logic inline. Deferred mode is opt-in via a deferred bool rather than always-on. Test infrastructure was expanded with deferred-mode helpers and dedicated unit tests. |
f5d8c70 to
2815bf9
Compare
Pure refactor: move the bodies of Watch::watch_channel and Watch::update_channel into methods on ChainMonitor, and have the Watch trait methods delegate to them. This prepares for adding deferred mode where the Watch methods will conditionally queue operations instead of executing them immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `deferred` parameter to `ChainMonitor::new` and `ChainMonitor::new_async_beta`. When set to true, the Watch trait methods (watch_channel and update_channel) will unimplemented!() for now. All existing callers pass false to preserve current behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2815bf9 to
3eb5644
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4351 +/- ##
==========================================
+ Coverage 86.06% 86.11% +0.05%
==========================================
Files 156 156
Lines 103188 103928 +740
Branches 103188 103928 +740
==========================================
+ Hits 88808 89497 +689
- Misses 11868 11907 +39
- Partials 2512 2524 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f964466 to
b140bf9
Compare
|
This PR is now ready for review. LDK-node counterpart: lightningdevkit/ldk-node#782 |
| Ok(ChannelMonitorUpdateStatus::Completed) | ||
| } | ||
|
|
||
| fn watch_channel_internal( |
There was a problem hiding this comment.
lol was it really faster to ask claude to do this than to just move the code? :p
There was a problem hiding this comment.
I don't like cursor keys anymore 😂
| pub fn new( | ||
| chain_source: Option<C>, broadcaster: T, logger: L, feeest: F, persister: P, | ||
| _entropy_source: ES, _our_peerstorage_encryption_key: PeerStorageKey, | ||
| _entropy_source: ES, _our_peerstorage_encryption_key: PeerStorageKey, deferred: bool, |
There was a problem hiding this comment.
This needs comprehensive documentation (copied on the new_async_beta constructor as well).
lightning/src/chain/chainmonitor.rs
Outdated
| deferred: bool, | ||
| /// Queued monitor operations awaiting flush. Unused when `deferred` is `false`. | ||
| /// | ||
| /// # Locking order with `monitors` |
There was a problem hiding this comment.
IMO lock order constraints should be written in code and checked, not in english :)
There was a problem hiding this comment.
You mean remove as we did with the giant lock tree doc? Or is there some more explicit way to write it in code other than just locking? Have to say that the lock debugger indeed helped with getting it right.
| } | ||
| } | ||
|
|
||
| /// Returns the number of pending monitor operations queued for later execution. |
There was a problem hiding this comment.
Needs description of how to use it.
lightning/src/chain/chainmonitor.rs
Outdated
| self.pending_ops.lock().unwrap().len() | ||
| } | ||
|
|
||
| /// Flushes up to `count` pending monitor operations. |
There was a problem hiding this comment.
Needs description of how to use it, including reference to pending_operation_count.
lightning/src/chain/chainmonitor.rs
Outdated
| /// remaining operations are left in the queue for the next call. | ||
| pub fn flush(&self, count: usize, logger: &L) -> Result<(), ()> { | ||
| if count > 0 { | ||
| log_trace!(logger, "Flushing up to {} monitor operations", count); |
There was a problem hiding this comment.
More than trace imo, maybe info even?
There was a problem hiding this comment.
Made it info, probably good to see that things are eventually persisted.
lightning/src/chain/chainmonitor.rs
Outdated
| }, | ||
| Err(()) => { | ||
| drop(queue); | ||
| // The monitor is consumed and cannot be retried. |
There was a problem hiding this comment.
should be unreachable, right?
There was a problem hiding this comment.
yes. want to panic?
in the original code, this did return an error to the caller of watch_channel
There was a problem hiding this comment.
This is indeed already checked before queueing. Made it unreachable. Could also remove the flush return value.
lightning/src/chain/chainmonitor.rs
Outdated
| }, | ||
| ChannelMonitorUpdateStatus::InProgress => {}, | ||
| ChannelMonitorUpdateStatus::UnrecoverableError => { | ||
| panic!("UnrecoverableError during monitor operation"); |
There was a problem hiding this comment.
This should be unreachable, right?
There was a problem hiding this comment.
This one is indeed truly unreachable. Made it unreachable.
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
Replace the unimplemented!() stubs with a full deferred write implementation. When ChainMonitor has deferred=true, Watch trait operations queue PendingMonitorOp entries instead of executing immediately. A new flush() method drains the queue and forwards operations to the internal watch/update methods, calling channel_monitor_updated on Completed status. The BackgroundProcessor is updated to capture pending_operation_count before persisting the ChannelManager, then flush that many writes afterward - ensuring monitor writes happen in the correct order relative to manager persistence. Key changes: - Add PendingMonitorOp enum and pending_ops queue to ChainMonitor - Implement flush() and pending_operation_count() public methods - Integrate flush calls in BackgroundProcessor (both sync and async) - Add TestChainMonitor::new_deferred, flush helpers, and auto-flush in release_pending_monitor_events for test compatibility - Add create_node_cfgs_deferred for deferred-mode test networks - Add unit tests for queue/flush mechanics and full payment flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b140bf9 to
04051fb
Compare
Summary
Modify
ChainMonitorinternally to queuewatch_channelandupdate_channeloperations, returningInProgressuntilflush()is called. This enables persistence of monitor updates afterChannelManagerpersistence, ensuring correct ordering where theChannelManagerstate is never ahead of the monitor state on restart. The new behavior is opt-in via adeferredswitch.Key changes:
ChainMonitorgains adeferredswitch to enable the new queuing behaviorInProgressflush()applies pending operations and persists monitorsChannelManagerpersistence, then flush after persistence completesPerformance Impact
Multi-channel, multi-node load testing (using ldk-server chaos branch) shows no measurable throughput difference between deferred and direct persistence modes.
This is likely because forwarding and payment processing are already effectively single-threaded: the background processor batches all forwards for the entire node in a single pass, so the deferral overhead doesn't add any meaningful bottleneck to an already serialized path.
For high-latency storage (e.g., remote databases), there is also currently no significant impact because channel manager persistence already blocks event handling in the background processor loop (test). If the loop were parallelized to process events concurrently with persistence, deferred writing would become comparatively slower since it moves the channel manager round trip into the critical path. However, deferred writing would also benefit from loop parallelization, and could be further optimized by batching the monitor and manager writes into a single round trip.
Alternative Designs Considered
Several approaches were explored to solve the monitor/manager persistence ordering problem:
1. Queue at KVStore level (#4310)
Introduces a
QueuedKVStoreSyncwrapper that queues all writes in memory, committing them in a single batch at chokepoints where data leaves the system (get_and_clear_pending_msg_events,get_and_clear_pending_events). This approach aims for true atomic multi-key writes but requires KVStore backends that support transactions (e.g., SQLite); filesystem backends cannot achieve full atomicity.Trade-offs: Most general solution but requires changes to persistence boundaries and cannot fully close the desync gap with filesystem storage.
2. Queue at Persister level (#4317)
Updates
MonitorUpdatingPersisterto queue persist operations in memory, with actual writes happening onflush(). Addsflush()to thePersisttrait andChainMonitor.Trade-offs: Only fixes the issue for
MonitorUpdatingPersister; customPersistimplementations remain vulnerable to the race condition.3. Queue at ChainMonitor wrapper level (#4345)
Introduces
DeferredChainMonitor, a wrapper aroundChainMonitorthat implements the queue in a separate wrapper layer. AllChainMonitortraits (Listen,Confirm,EventsProvider, etc.) are passed through, allowing drop-in replacement.Trade-offs: Requires re-implementing all trait pass-throughs on the wrapper. Keeps the core
ChainMonitorunchanged but adds an external layer of indirection.