fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26
Merged
mickvandijke merged 12 commits intomainfrom Apr 3, 2026
Merged
fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26mickvandijke merged 12 commits intomainfrom
mickvandijke merged 12 commits intomainfrom
Conversation
6dfe1ee to
1ed2643
Compare
…indexing - Add expected_data_size parameter to collect_validated_candidates() to reject nodes that return tampered data_size in quoting metrics - Fix direct indexing addresses[i] -> safe zip iterator (pre-existing project rule violation)
Query CLOSE_GROUP_SIZE * 2 (10) peers from DHT, send quote requests to all concurrently, and keep the CLOSE_GROUP_SIZE (5) closest successful responders sorted by XOR distance. This tolerates up to 5 peer failures (timeout, bad quote, etc.) without aborting the entire quote collection.
…racking
- store_paid_chunks() now retries failed chunks up to 3 times with
exponential backoff (500ms, 1s, 2s) before giving up
- Returns WaveResult { stored, failed } instead of discarding partial
successes on first error
- Add PartialUpload error variant that carries both stored addresses
and failed chunk details so callers can track progress
- PaidChunk is now Clone to support retry
- batch_upload_chunks() accumulates stored addresses across waves and
reports them even on partial failure
When pay_for_merkle_multi_batch fails on sub-batch N, return proofs from sub-batches 1..N-1 instead of discarding them. This prevents losing already-paid tokens when a later sub-batch fails -- callers can still store the chunks that were paid for.
…mode - file_prepare_upload: replace sequential for-loop with concurrent buffer_unordered pattern (same as prepare_wave) for quote collection - data_prepare_upload: same concurrent fix - Add payment_mode field to PreparedUpload so finalize_upload reports the actual mode used instead of hardcoding PaymentMode::Single
- New e2e_upload_costs test: uploads files at 200MB, 1GB, 4GB, 8GB in both Single and Merkle modes, reports ANT cost, gas cost, chunk count, and EVM transaction count in a formatted table - New 8GB test in e2e_huge_file for memory bounding verification
- Increase testnet to 20 nodes (merkle needs CANDIDATES_PER_POOL=16) - Create separate files with different seeds for Single vs Merkle to prevent AlreadyStored when uploading the same content twice
Measured results on local Anvil testnet (20 nodes): - 10MB (3 chunks): Single 108K gwei, Merkle 172K gwei (overhead) - 50MB (16 chunks): Single 278K gwei, Merkle 177K gwei (36% savings) - 200MB (54 chunks): Single 596K gwei, Merkle 160K gwei (73% savings) - 500MB (129 chunks): Single 1041K gwei (merkle: disk full on testnet) Gas savings increase with file size. Merkle breaks even around ~10 chunks and delivers 3-4x gas reduction at 50+ chunks.
- Spill dirs now live under <data_dir>/spill/ instead of system temp (e.g. ~/Library/Application Support/ant/spill/ on macOS) - Dir names include Unix timestamp: <timestamp>_<random> - On each new upload, stale spill dirs older than 24h are cleaned up (catches orphans from crashed/killed processes) - Disk space check now queries the spill root instead of temp dir
…prefix Review-driven fixes for ChunkSpill: - Add lockfile (.lock) inside each spill dir, held for the upload's lifetime via fs2 exclusive lock. cleanup_stale skips dirs with active locks, preventing deletion of in-progress uploads. - Only delete entries starting with 'spill_' prefix, preventing accidental deletion of unrelated files in the spill root. - Check entry.file_type().is_dir() before remove_dir_all to prevent symlink attacks (following symlinks to delete arbitrary dirs). - Skip cleanup entirely when system clock is broken (timestamp 0). - Add run_cleanup() public method for client startup use. - Import crate::config at function scope instead of inline path.
c9dd580 to
a33b639
Compare
mickvandijke
approved these changes
Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production hardening of the upload and payment pipeline:
Upload reliability:
PartialUploaderror variant instead of silently losing stored chunksQuote collection fault tolerance:
Merkle candidate validation:
data_sizeon merkle candidate responses (prevents pricing manipulation)addresses[i]) to safe zip iteratorExternal signer improvements:
buffer_unordered)payment_modetracking inPreparedUploadChunk spill hardening:
data_dir/spill/(predictable, persistent location)spill_prefixed dirsIndependent
This PR is self-contained. Uses only published evmlib 0.5.0 and ant-node 0.9.0 from crates.io. No dependencies on other PRs.
Test plan