Skip to content

switchrpc: add DeleteAttempts RPC#10602

Merged
ziggie1984 merged 4 commits intolightningnetwork:elle-base-branch-payment-servicefrom
calvinrzachman:switchrpc-delete-attempts
Mar 19, 2026
Merged

switchrpc: add DeleteAttempts RPC#10602
ziggie1984 merged 4 commits intolightningnetwork:elle-base-branch-payment-servicefrom
calvinrzachman:switchrpc-delete-attempts

Conversation

@calvinrzachman
Copy link
Copy Markdown
Collaborator

@calvinrzachman calvinrzachman commented Feb 23, 2026

Change Description

Add a DeleteAttempts RPC to the switchrpc sub-server, enabling remote clients to perform safe, online garbage collection of terminal HTLC attempt records from the AttemptStore.

This uses a more explicit approach to deletion: the client names the specific finished attempts to delete. Unlike the existing CleanStore (which deletes everything except a keep-set), DeleteAttempts is immune to stale-snapshot race
conditions and does not require a disruptive quiescent state — making it suitable for use when a remote router manages payments.

If a particular request to delete a batch of attempts fails entirely or partially, the client can just try again later. They do not need to worry about conducting the deletion in parallel with processing htlc attempts for new payment requests.

  • Each attempt ID gets an independent status result (OK, PENDING, FAILED NOT_FOUND) in the response.
  • The server refuses to delete in-flight attempts, even if the client requests it.
  • The internal ChannelRouter's cleanup logic is not modified. This is a new primitive exposed only via switchrpc. An RPC client looking to use this endpoint will have to track the appropriate state necessary to determine which HTLC attempts have been completed (settled/failed) but which are yet to be deleted.

Addresses: #10519
Replaces (favored over): #10484

Future Addition

  • Tombstone-based deletion: Deleted records can be replaced with a lightweight sentinel record that preserves idempotency protection (preventing duplicate HTLC dispatch from stale requests). Tombstones can be swept automatically on server restart. This is an extra layer of protection against client mis-use.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @calvinrzachman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial new RPC, DeleteAttempts, to the switchrpc interface, empowering external routing clients to perform explicit and safe garbage collection of completed HTLC attempt records. This enhancement addresses the challenges of managing payment state in distributed systems by providing a more robust and race-condition-resistant method for cleaning up terminal payment attempts, contrasting with the less explicit CleanStore approach. It also builds upon recent changes to the SendOnion RPC, which now provides richer, structured error details, ensuring a more reliable and predictable interaction model for clients managing payment lifecycles.

Highlights

  • New RPC: DeleteAttempts: A new DeleteAttempts RPC has been added to the switchrpc sub-server. This RPC allows remote clients to explicitly delete terminal (settled or failed) HTLC attempt records from the AttemptStore.
  • Safe Garbage Collection: This new deletion mechanism provides a safer, 'positive space' alternative to the existing CleanStore method. It is immune to stale-snapshot race conditions and does not require a disruptive quiescent state, making it suitable for distributed systems.
  • Granular Deletion Status: The DeleteAttempts RPC returns per-ID results for each requested attempt, indicating its deletion status (e.g., OK, PENDING, NOT_FOUND, ALREADY_DELETED). In-flight attempts are explicitly refused deletion.
  • SendOnion RPC Overhaul (Context): The SendOnion RPC (overhauled in a prior PR) now uses gRPC status codes for outcomes. Failures include structured SendOnionFailureDetails to classify them as DefiniteFailure (safe to fail and retry with new route) or IndefiniteFailure (client MUST retry same request).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/release-notes/release-notes-0.21.0.md
    • Documented the overhaul of the switchrpc.SendOnion RPC, detailing its new client-friendly API and breaking changes.
    • Added an entry for the new switchrpc.DeleteAttempts RPC, explaining its purpose for cleaning up terminal attempt records and its per-ID result reporting.
  • htlcswitch/interfaces.go
    • Added the DeleteAttempts method to the AttemptStore interface, allowing for explicit removal of terminal attempt results.
  • htlcswitch/payment_result.go
    • Defined a new DeletionStatus enum to represent the outcome of an attempt deletion.
    • Implemented the DeleteAttempts method within networkResultStore, including logic for handling pending attempts, duplicate IDs, and database interactions.
  • htlcswitch/payment_result_test.go
    • Added a test case to verify subscriber notification after FailPendingAttempt.
    • Introduced comprehensive tests for the DeleteAttempts method, covering various scenarios like empty ID lists, successful deletions, refusal to delete pending attempts, duplicate ID rejection, unknown IDs, mixed batches, idempotent re-deletion, and re-initialization after deletion.
  • itest/lnd_sendonion_test.go
    • Removed unused context and rpc imports.
    • Updated alice.RPC.SendOnion calls to reflect the new API, which now returns an error directly instead of a SendOnionResponse struct.
    • Added a test case within testSendOnionTwice to demonstrate the usage and functionality of the new DeleteAttempts RPC after a settled HTLC.
  • lnrpc/switchrpc/mock.go
    • Added sync import.
    • Extended mockAttemptStore with deleteErr and deleteResp fields to mock DeleteAttempts behavior.
    • Introduced statefulAttemptStore to simulate idempotency and deletion behavior more realistically for testing purposes.
  • lnrpc/switchrpc/switch.pb.go
    • Replaced ErrorCode enum with IndefiniteFailureCode and AttemptDeletionStatus enums.
    • Modified SendOnionResponse to be an empty message, aligning with the new gRPC status-based error handling.
    • Added new message types: SendOnionFailureDetails, DefiniteFailure, IndefiniteFailure, DeleteAttemptsRequest, DeleteAttemptsResponse, and AttemptDeletionResult to support the new RPC and detailed error reporting.
    • Updated proto message and enum type registrations to reflect the new definitions.
  • lnrpc/switchrpc/switch.pb.gw.go
    • Added gRPC-Gateway request and local request functions for the new DeleteAttempts RPC.
    • Registered DeleteAttempts handler with the gRPC-Gateway server and client.
  • lnrpc/switchrpc/switch.proto
    • Updated the description for the SendOnion RPC to clarify the new gRPC status-based outcome handling and the inclusion of SendOnionFailureDetails.
    • Defined the new DeleteAttempts RPC, including its request and response messages (DeleteAttemptsRequest, DeleteAttemptsResponse).
    • Added new message types: SendOnionFailureDetails, DefiniteFailure, IndefiniteFailure, and their associated enums (IndefiniteFailureCode, AttemptDeletionStatus) for richer error reporting and deletion outcomes.
  • lnrpc/switchrpc/switch.swagger.json
    • Added Swagger definitions for the new /v2/switch/deleteattempts endpoint, including its summary, description, and request/response schemas.
    • Updated the Swagger description for /v2/switch/onion to reflect the new error handling and SendOnionFailureDetails.
    • Added new schema definitions for switchrpcAttemptDeletionResult, switchrpcAttemptDeletionStatus, switchrpcDeleteAttemptsRequest, switchrpcDeleteAttemptsResponse, and removed switchrpcErrorCode.
  • lnrpc/switchrpc/switch.yaml
    • Added a new HTTP rule for the DeleteAttempts RPC, mapping it to a POST request at /v2/switch/deleteattempts.
  • lnrpc/switchrpc/switch_grpc.pb.go
    • Updated the SendOnion RPC signature in the SwitchClient and SwitchServer interfaces to reflect the new empty SendOnionResponse and gRPC status-based error handling.
    • Added the DeleteAttempts RPC to the SwitchClient and SwitchServer interfaces.
    • Implemented the DeleteAttempts method in switchClient and UnimplementedSwitchServer.
    • Registered the DeleteAttempts handler in Switch_ServiceDesc.
  • lnrpc/switchrpc/switch_server.go
    • Removed unused encoding/hex import.
    • Added DeleteAttempts RPC to the macaroonPermissions map with offchain:write permissions.
    • Updated the SendOnion RPC implementation to use the new marshallDispatchFailure function for error handling, providing structured SendOnionFailureDetails.
    • Implemented the DeleteAttempts RPC, including input validation, calling the AttemptStore, and converting internal deletion statuses to proto responses.
    • Introduced deletionStatusToProto helper for converting internal deletion statuses.
    • Added marshallDispatchFailure function to translate internal HTLC switch errors into gRPC status errors with rich details.
    • Added GetSendOnionFailureDetails helper to extract structured failure details from gRPC errors.
  • lnrpc/switchrpc/switch_server_test.go
    • Removed unused encoding/hex import.
    • Updated TestSendOnion to reflect the new SendOnion RPC API, including checking for gRPC status codes and using checkFailureDetails for structured error validation.
    • Added requireSendOnionFailureDetails helper for asserting SendOnion failure details in tests.
    • Introduced TestDeleteAttempts to verify the DeleteAttempts RPC handler's behavior, including mixed results, empty requests, duplicate IDs, and store errors.
    • Added TestDeletionStatusToProto to test the conversion of internal deletion statuses to proto enums.
    • Added TestSendOnionAfterDelete to demonstrate that deleting a terminal attempt record removes idempotency protection.
    • Updated TestMarshallDispatchFailure to test the new error marshalling logic for SendOnion.
  • lntest/rpc/switch.go
    • Modified the SendOnion helper to return an error directly, aligning with the updated RPC signature.
    • Added a new DeleteAttempts helper function for making RPC calls to the DeleteAttempts endpoint in integration tests.
Activity
  • The pull request introduces a new RPC, DeleteAttempts, and refactors error handling for SendOnion. No specific human activity like comments or reviews are provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable DeleteAttempts RPC to switchrpc for garbage collecting terminal HTLC attempts, which is a great feature for external routers. The accompanying overhaul of the SendOnion RPC to provide more robust, structured error handling is also a significant improvement. The code is well-structured and includes thorough testing for the new functionality.

My main concern is an inconsistency in the DeleteAttempts API regarding the DELETION_ALREADY_DELETED status. The API documentation and enums define this status, but the underlying implementation currently returns DELETION_NOT_FOUND for already-deleted attempts. This could lead to confusion for clients. I've provided detailed comments with suggestions on how to address this.

Comment thread htlcswitch/payment_result.go Outdated
Comment thread lnrpc/switchrpc/switch.proto Outdated
Comment thread htlcswitch/payment_result_test.go Outdated
@saubyk saubyk added this to v0.21 Feb 23, 2026
@saubyk saubyk added this to the v0.21.0 milestone Feb 23, 2026
@saubyk saubyk moved this to In progress in v0.21 Feb 23, 2026
@calvinrzachman calvinrzachman force-pushed the switchrpc-delete-attempts branch from d460d84 to 27c2b33 Compare February 24, 2026 14:10
Copy link
Copy Markdown
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept looks good to me 👍

Comment thread htlcswitch/payment_result.go
Comment thread htlcswitch/payment_result_test.go Outdated
Comment thread htlcswitch/payment_result.go Outdated
Comment thread htlcswitch/payment_result.go Outdated
@calvinrzachman calvinrzachman force-pushed the switchrpc-delete-attempts branch from 27c2b33 to 53fa0ca Compare February 26, 2026 18:07
@lightninglabs-deploy lightninglabs-deploy added the severity-critical Requires expert review - security/consensus critical label Feb 26, 2026
@calvinrzachman
Copy link
Copy Markdown
Collaborator Author

I'll fix the lint job complaining about unnecessary use of nolint and release notes on next push.

Copy link
Copy Markdown
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM 🎉

// record was fully removed.
err = store.InitAttempt(id)
require.NoError(t, err)
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could add a test case for an unreadable message

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case which writes corrupt bytes directly to the store and verifies DeletionFailed is returned.

@lightninglabs-deploy
Copy link
Copy Markdown
Collaborator

@ziggie1984: review reminder
@calvinrzachman, remember to re-request review from reviewers when ready

@ziggie1984
Copy link
Copy Markdown
Collaborator

The decision to skip deletion of an undeserializable entry deserves a second look.

The deserialization here exists solely to check the pendingHtlcMsgType guard — i.e. to avoid deleting a live in-flight HTLC. But if the bytes are corrupt, that guard cannot be evaluated, and the entry is left permanently in the store with no recovery path via this API.

A corrupt entry is not a safe guard. When InitAttempt is called for the same attempt ID, fetchResult tries to deserialize the same corrupt bytes and returns a deserialization error — not ErrPaymentIDNotFound. That propagates as ErrAmbiguousAttemptInit, which maps to codes.Unavailable / HTLC_INIT_FAILED on the RPC layer. The documented advice for that error is "retry to resolve the ambiguity". But retrying will never resolve it: the corrupt bytes are still there on every retry, the same error is returned, and the client loops forever with no way to make progress.

By contrast, the intended ErrAmbiguousAttemptInit scenario (a transient DB write failure) does resolve on retry — either the entry is absent and the write succeeds, or it is present and ErrPaymentIDAlreadyExists is returned cleanly. A corrupt entry produces none of those outcomes.

Deleting the corrupt entry and logging a warning would restore the ability for the client to make progress. The entry cannot deliver a result to any subscriber (deserialization would fail there too), so keeping it only permanently poisons the attempt ID.

Copy link
Copy Markdown
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good Calvin 🙌

had some minor questions and comments.

"for attempt %d during delete: %v",
id, err)

results[id] = DeletionFailed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see this #10602 (comment)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deserialization failure scenario you describe is real, but I don't think the "retry loop" applies under the expected API usage:

  1. SendOnion → get ACK (idempotency record created)
  2. TrackOnion → get SETTLE/FAIL (record final result in client DB)
  3. DeleteAttempts → clean up (record + duplicate prevention removed)

A well-behaved client must never call DeleteAttempts for an ID it hasn't fully finalized. Such a client shouldn't have need to call SendOnion again for that ID, so I don't think there'd be a perpetual loop with ErrAmbiguousAttemptInit.

CleanStore (which the ChannelRouter calls at startup) already deletes corrupt entries without checking their state since it operates purely on keys. So DeletionFailed is actually more cautious than normal lnd flow in this regard. I could go either way on this — happy to just delete and log a warning if you'd prefer, or keep DeletionFailed for now.

The concern about InitAttempt returning ErrAmbiguousAttemptInit indefinitely for a corrupt ID is real but is more of a SendOnion error classification problem (distinguishing transient DB failures from corruption) than a DeleteAttempts design question. Worth a follow-up that improves error classification across the store operations.

Comment thread htlcswitch/payment_result.go
Comment thread htlcswitch/payment_result_test.go Outdated
Comment thread lnrpc/switchrpc/mock.go Outdated
// simulate real idempotency behavior. Unlike the basic mockAttemptStore, this
// mock maintains state across calls, making it suitable for testing multi-step
// interactions between SendOnion and DeleteAttempts.
type statefulAttemptStore struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statefulAttemptStore mock diverges from real store behaviour in a meaningful way: it treats every initialized attempt as immediately deletable (DeletionOK), whereas the real networkResultStore returns DeletionPending for any attempt that is still in-flight (stored with pendingHtlcMsgType). This means TestSendOnionAfterDelete would pass even if the code incorrectly allowed deletion of live in-flight attempts.

Consider dropping statefulAttemptStore entirely and wiring the test against the real networkResultStore backed by an in-memory bbolt database — the same pattern already used in htlcswitch/payment_result_test.go (e.g. newTestNetworkResultStore). That would give the test genuine fidelity: a just-initialized (in-flight) attempt would correctly block deletion (DeletionPending), and only after a terminal result is stored would DeleteAttempts return DeletionOK — exactly the property this test intends to prove.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real networkResultStore (and networkResult) are unexported from htlcswitch so we can't wire the test against the real store from the switchrpc package, at least at the moment. I think you're probably right that we're testing the mock a bit too much here, so I removed statefulAttemptStore and TestSendOnionAfterDelete and replaced them with a property-based test (TestDeleteAttemptsProperties) in htlcswitch/payment_result_test.go that runs against the real bbolt-backed store. Got the idea here.

Comment thread lnrpc/switchrpc/switch_server.go Outdated
seen := make(map[uint64]struct{}, len(req.AttemptIds))
for _, id := range req.AttemptIds {
if _, ok := seen[id]; ok {
return nil, status.Errorf(codes.InvalidArgument,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to be so strict tho, just ignore the redundant ID maybe ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are depulicating acutally now at two different call sites ? seems redundant ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on both counts. The strictness can help alert a buggy caller which should not be including duplicates. I changed the RPC layer to silently deduplicate instead of rejecting — duplicates are collapsed before reaching the store. I think the store-level duplicate check should stay as a safety net since the per-ID mutex will deadlock if the same ID is locked twice on the same go-routine. RPC callers won't hit it, but if DeleteAttempts is ever called internally without the RPC layer, it'll be nice to have the check in there.

Comment thread lnrpc/switchrpc/switch_server.go Outdated
seen[id] = struct{}{}
}

storeResults, err := s.cfg.AttemptStore.DeleteAttempts(req.AttemptIds)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this function already takes a map ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the map could help with duplicates, but the benefit may be somewhat limited because the store needs ordered iteration for lock acquisition (slices.Sort before locking) so concurrent callers don't cause potential deadlock. Since maps have random iteration order, we'd have to extract keys into a slice and sort anyway. The RPC layer now handles dedup less strictly like you mentioned so I think we're probably okay here.


// TestDeletionStatusToProto verifies the mapping from internal DeletionStatus
// values to their proto enum counterparts.
func TestDeletionStatusToProto(t *testing.T) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestDeletionStatusToProto does not enforce exhaustiveness: if a new DeletionStatus value is added to the iota without a matching proto entry and test case, the test still passes because the default branch silently swallows unknown values.

Consider adding a sentinel numDeletionStatuses at the end of the iota and asserting len(testCases)-1 == int(htlcswitch.numDeletionStatuses) at the top of the test (the -1 accounts for the existing unknown/default test case). That way any new internal status value forces a test update, and by extension a proto and mapping update too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay if it doesn't enforce exhaustiveness? The test covers all 4 statuses plus the unknown-value default. That's an interesting idea re the sentinel, but do you think it's worth exporting a variable from htlcswitch package for this? It also may not as readily apply if the mapping from internal to external states isn't 1:1 like seen with convertPaymentStatus in lnrpc/routerrpc package.

@ziggie1984
Copy link
Copy Markdown
Collaborator

needs a rebase

Add a "positive space" deletion primitive for the network
result store. Unlike CleanStore (which deletes everything
except a keep-set and is vulnerable to stale-snapshot races),
DeleteAttempts accepts an explicit list of attempt IDs to
remove.

The implementation acquires per-ID mutexes, checks each
record's state within a single kvdb transaction, and only
deletes terminal (settled or failed) results. Pending
attempts are protected by a server-side guard.

Per-ID results are returned so callers get actionable
feedback: OK, Pending, Failed, or NotFound.
Expose the store-layer DeleteAttempts as a gRPC endpoint
on the Switch sub-server, allowing remote callers to
garbage-collect terminal attempt records by explicitly
naming the finished IDs.

A unit test with a stateful mock demonstrates that hard
delete removes the idempotency protection, allowing
re-dispatch with the same ID.
Extend testSendOnionTwice to call DeleteAttempts after
the payment settles, verifying that removing a terminal
attempt returns DELETION_OK, and a subsequent delete of
the same ID returns DELETION_NOT_FOUND.
@calvinrzachman calvinrzachman force-pushed the switchrpc-delete-attempts branch from 716a3c2 to 040471f Compare March 18, 2026 18:00
Copy link
Copy Markdown
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ziggie1984 ziggie1984 merged commit 4ab70be into lightningnetwork:elle-base-branch-payment-service Mar 19, 2026
42 of 43 checks passed
@github-project-automation github-project-automation Bot moved this from In progress to Done in v0.21 Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

payments-v2 severity-critical Requires expert review - security/consensus critical

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants