switchrpc: add DeleteAttempts RPC#10602
switchrpc: add DeleteAttempts RPC#10602ziggie1984 merged 4 commits intolightningnetwork:elle-base-branch-payment-servicefrom
Conversation
Summary of ChangesHello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial new RPC, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable DeleteAttempts RPC to switchrpc for garbage collecting terminal HTLC attempts, which is a great feature for external routers. The accompanying overhaul of the SendOnion RPC to provide more robust, structured error handling is also a significant improvement. The code is well-structured and includes thorough testing for the new functionality.
My main concern is an inconsistency in the DeleteAttempts API regarding the DELETION_ALREADY_DELETED status. The API documentation and enums define this status, but the underlying implementation currently returns DELETION_NOT_FOUND for already-deleted attempts. This could lead to confusion for clients. I've provided detailed comments with suggestions on how to address this.
d460d84 to
27c2b33
Compare
bitromortac
left a comment
There was a problem hiding this comment.
The concept looks good to me 👍
27c2b33 to
53fa0ca
Compare
|
I'll fix the lint job complaining about unnecessary use of |
| // record was fully removed. | ||
| err = store.InitAttempt(id) | ||
| require.NoError(t, err) | ||
| }) |
There was a problem hiding this comment.
nit: You could add a test case for an unreadable message
There was a problem hiding this comment.
Added a test case which writes corrupt bytes directly to the store and verifies DeletionFailed is returned.
53fa0ca to
b03978b
Compare
|
@ziggie1984: review reminder |
|
The decision to skip deletion of an undeserializable entry deserves a second look. The deserialization here exists solely to check the A corrupt entry is not a safe guard. When By contrast, the intended Deleting the corrupt entry and logging a warning would restore the ability for the client to make progress. The entry cannot deliver a result to any subscriber (deserialization would fail there too), so keeping it only permanently poisons the attempt ID. |
ziggie1984
left a comment
There was a problem hiding this comment.
Looking good Calvin 🙌
had some minor questions and comments.
| "for attempt %d during delete: %v", | ||
| id, err) | ||
|
|
||
| results[id] = DeletionFailed |
There was a problem hiding this comment.
The deserialization failure scenario you describe is real, but I don't think the "retry loop" applies under the expected API usage:
SendOnion→ get ACK (idempotency record created)TrackOnion→ get SETTLE/FAIL (record final result in client DB)DeleteAttempts→ clean up (record + duplicate prevention removed)
A well-behaved client must never call DeleteAttempts for an ID it hasn't fully finalized. Such a client shouldn't have need to call SendOnion again for that ID, so I don't think there'd be a perpetual loop with ErrAmbiguousAttemptInit.
CleanStore (which the ChannelRouter calls at startup) already deletes corrupt entries without checking their state since it operates purely on keys. So DeletionFailed is actually more cautious than normal lnd flow in this regard. I could go either way on this — happy to just delete and log a warning if you'd prefer, or keep DeletionFailed for now.
The concern about InitAttempt returning ErrAmbiguousAttemptInit indefinitely for a corrupt ID is real but is more of a SendOnion error classification problem (distinguishing transient DB failures from corruption) than a DeleteAttempts design question. Worth a follow-up that improves error classification across the store operations.
| // simulate real idempotency behavior. Unlike the basic mockAttemptStore, this | ||
| // mock maintains state across calls, making it suitable for testing multi-step | ||
| // interactions between SendOnion and DeleteAttempts. | ||
| type statefulAttemptStore struct { |
There was a problem hiding this comment.
The statefulAttemptStore mock diverges from real store behaviour in a meaningful way: it treats every initialized attempt as immediately deletable (DeletionOK), whereas the real networkResultStore returns DeletionPending for any attempt that is still in-flight (stored with pendingHtlcMsgType). This means TestSendOnionAfterDelete would pass even if the code incorrectly allowed deletion of live in-flight attempts.
Consider dropping statefulAttemptStore entirely and wiring the test against the real networkResultStore backed by an in-memory bbolt database — the same pattern already used in htlcswitch/payment_result_test.go (e.g. newTestNetworkResultStore). That would give the test genuine fidelity: a just-initialized (in-flight) attempt would correctly block deletion (DeletionPending), and only after a terminal result is stored would DeleteAttempts return DeletionOK — exactly the property this test intends to prove.
There was a problem hiding this comment.
The real networkResultStore (and networkResult) are unexported from htlcswitch so we can't wire the test against the real store from the switchrpc package, at least at the moment. I think you're probably right that we're testing the mock a bit too much here, so I removed statefulAttemptStore and TestSendOnionAfterDelete and replaced them with a property-based test (TestDeleteAttemptsProperties) in htlcswitch/payment_result_test.go that runs against the real bbolt-backed store. Got the idea here.
| seen := make(map[uint64]struct{}, len(req.AttemptIds)) | ||
| for _, id := range req.AttemptIds { | ||
| if _, ok := seen[id]; ok { | ||
| return nil, status.Errorf(codes.InvalidArgument, |
There was a problem hiding this comment.
why do we want to be so strict tho, just ignore the redundant ID maybe ?
There was a problem hiding this comment.
so we are depulicating acutally now at two different call sites ? seems redundant ?
There was a problem hiding this comment.
Good point on both counts. The strictness can help alert a buggy caller which should not be including duplicates. I changed the RPC layer to silently deduplicate instead of rejecting — duplicates are collapsed before reaching the store. I think the store-level duplicate check should stay as a safety net since the per-ID mutex will deadlock if the same ID is locked twice on the same go-routine. RPC callers won't hit it, but if DeleteAttempts is ever called internally without the RPC layer, it'll be nice to have the check in there.
| seen[id] = struct{}{} | ||
| } | ||
|
|
||
| storeResults, err := s.cfg.AttemptStore.DeleteAttempts(req.AttemptIds) |
There was a problem hiding this comment.
maybe this function already takes a map ?
There was a problem hiding this comment.
I think the map could help with duplicates, but the benefit may be somewhat limited because the store needs ordered iteration for lock acquisition (slices.Sort before locking) so concurrent callers don't cause potential deadlock. Since maps have random iteration order, we'd have to extract keys into a slice and sort anyway. The RPC layer now handles dedup less strictly like you mentioned so I think we're probably okay here.
|
|
||
| // TestDeletionStatusToProto verifies the mapping from internal DeletionStatus | ||
| // values to their proto enum counterparts. | ||
| func TestDeletionStatusToProto(t *testing.T) { |
There was a problem hiding this comment.
TestDeletionStatusToProto does not enforce exhaustiveness: if a new DeletionStatus value is added to the iota without a matching proto entry and test case, the test still passes because the default branch silently swallows unknown values.
Consider adding a sentinel numDeletionStatuses at the end of the iota and asserting len(testCases)-1 == int(htlcswitch.numDeletionStatuses) at the top of the test (the -1 accounts for the existing unknown/default test case). That way any new internal status value forces a test update, and by extension a proto and mapping update too.
There was a problem hiding this comment.
Is it okay if it doesn't enforce exhaustiveness? The test covers all 4 statuses plus the unknown-value default. That's an interesting idea re the sentinel, but do you think it's worth exporting a variable from htlcswitch package for this? It also may not as readily apply if the mapping from internal to external states isn't 1:1 like seen with convertPaymentStatus in lnrpc/routerrpc package.
b03978b to
716a3c2
Compare
|
needs a rebase |
Add a "positive space" deletion primitive for the network result store. Unlike CleanStore (which deletes everything except a keep-set and is vulnerable to stale-snapshot races), DeleteAttempts accepts an explicit list of attempt IDs to remove. The implementation acquires per-ID mutexes, checks each record's state within a single kvdb transaction, and only deletes terminal (settled or failed) results. Pending attempts are protected by a server-side guard. Per-ID results are returned so callers get actionable feedback: OK, Pending, Failed, or NotFound.
Expose the store-layer DeleteAttempts as a gRPC endpoint on the Switch sub-server, allowing remote callers to garbage-collect terminal attempt records by explicitly naming the finished IDs. A unit test with a stateful mock demonstrates that hard delete removes the idempotency protection, allowing re-dispatch with the same ID.
Extend testSendOnionTwice to call DeleteAttempts after the payment settles, verifying that removing a terminal attempt returns DELETION_OK, and a subsequent delete of the same ID returns DELETION_NOT_FOUND.
716a3c2 to
040471f
Compare
4ab70be
into
lightningnetwork:elle-base-branch-payment-service
Change Description
Add a
DeleteAttemptsRPC to theswitchrpcsub-server, enabling remote clients to perform safe, online garbage collection of terminal HTLC attempt records from theAttemptStore.This uses a more explicit approach to deletion: the client names the specific finished attempts to delete. Unlike the existing
CleanStore(which deletes everything except a keep-set),DeleteAttemptsis immune to stale-snapshot raceconditions and does not require a disruptive quiescent state — making it suitable for use when a remote router manages payments.
If a particular request to delete a batch of attempts fails entirely or partially, the client can just try again later. They do not need to worry about conducting the deletion in parallel with processing htlc attempts for new payment requests.
OK,PENDING,FAILEDNOT_FOUND) in the response.ChannelRouter's cleanup logic is not modified. This is a new primitive exposed only viaswitchrpc. An RPC client looking to use this endpoint will have to track the appropriate state necessary to determine which HTLC attempts have been completed (settled/failed) but which are yet to be deleted.Addresses: #10519
Replaces (favored over): #10484
Future Addition