Skip to content

feat: restore sequencer#3061

Open
alpe wants to merge 4 commits intomainfrom
alex/2521_sync
Open

feat: restore sequencer#3061
alpe wants to merge 4 commits intomainfrom
alex/2521_sync

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Feb 10, 2026

Resolves #2521

Overview

This PR implements the Sequencer Recovery mode, enabling a node to restore its state from the Data Availability (DA) layer and P2P network before assuming the role of a sequencer.

Key Changes:

  • Configuration: added node.sequencer_recovery start param, allowing configurable timeouts for the catch-up phase.
  • Recovery Logic: Implemented sequencerRecoveryElector to manage the lifecycle:
    • Starts the node in Sync/Follower mode.
    • Monitors DA and P2P sync progress using monitorCatchup.
    • Ensures the local store is populated (storeHeight > 0) before transition.
    • Promotes the node to Aggregator mode once fully caught up.
  • Integration Test: Added TestSequencerRecoveryFromDA to verify the end-to-end recovery flow, ensuring the node successfully syncs historical blocks and resumes block production.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 17, 2026, 3:51 PM

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @alpe's task in 2m 24s —— View job


Code Review

  • Gather context (read changed files)
  • Analyze architecture and code quality
  • Review error handling and resilience
  • Review testing coverage
  • Post comprehensive review

Summary

This PR implements an aggregator catchup phase that allows a sequencer to sync from DA and P2P before starting block production. The implementation is well-structured and addresses a real operational need. The PR has already undergone significant refactoring from the original description (renaming SequencerRecoveryCatchupTimeout, consolidating the sequencerRecoveryElector into failoverState directly). Below is a detailed review.


Architecture / Design

✅ Good: Consolidation into failoverState
The catchup phase is now built directly into failoverState.Run() (node/failover.go:228-232) rather than as a separate sequencerRecoveryElector struct. This is cleaner and avoids an extra abstraction layer.

✅ Good: Mutual exclusion enforcement
The MarkFlagsMutuallyExclusive(FlagCatchupTimeout, FlagRaftEnable) call at pkg/config/config.go:567 and the validation at lines 413-415 both enforce that catchup and Raft can't be combined, which is correct.

⚠️ Issue: daHeadReached never resets — semantics may surprise
In block/internal/syncing/syncer.go:411, daHeadReached is set to true and never reset, even when a new error occurs. The comment says "stays true once DA head is seen" — this is intentional for the catchup use case, but if the same Syncer instance is ever reused across multiple catchup phases or in long-running scenarios, the flag becomes stale. Since the syncer is nilled out after the catchup phase (failover.go:267), this is safe for the current use case, but the potential footgun is worth a brief doc comment at the field declaration site in addition to the method.

⚠️ Issue: waitForCatchup context cancellation returns (false, nil) instead of the context error
node/failover.go:293:

case <-ctx.Done():
    return false, nil

When the context is cancelled, runCatchupPhase calls ctx.Err() at line 264 to propagate the error. This is a two-step dance that works but is subtle. A cleaner pattern would be return false, ctx.Err() and let the caller check for context.Canceled — this makes the error path more explicit and consistent with how the rest of the codebase handles cancellation. As written, returning (false, nil) with a nil error during shutdown could look like a successful-but-inconclusive state rather than a graceful shutdown. Fix this →

⚠️ Issue: P2P "caught up" when timeout == 0 and no peers
node/failover.go:313-315:

if !ignoreP2P && f.catchupTimeout == 0 && maxP2PHeight == 0 {
    p2pCaughtUp = true
}

This special case (set P2P as caught up if timeout is zero and no P2P height yet) exists to handle the "DA-only" mode. However, catchupTimeout == 0 means the catchup feature is disabled entirely (see setupFailoverState:162), so runCatchupPhase would never be called with catchupTimeout == 0. This dead code branch is confusing and should be removed or have a comment explaining the invariant. Fix this →

⚠️ Issue: Shared cache between catchup syncer and future executor
In block/components.go:362-363, the catchup syncer shares the same bc.Cache as the executor components. This is fine for correctness (the cache is populated during catchup, and the executor benefits from it), but there's a subtle risk: if the catchup syncer and executor both write to the cache concurrently during the transition window, race conditions could occur. The current code stops the syncer before starting the executor (failover.go:257-258 stops syncer, then 235-244 starts block production), so this is safe, but the documentation at block/components.go:327-333 should make this ordering constraint explicit.


Error Handling & Resilience

✅ Good: Deferred syncer stop
node/failover.go:257: defer f.bc.Syncer.Stop() correctly ensures the syncer is stopped even if waitForCatchup returns an error.

⚠️ Issue: catchupErrCh in NewAggregatorWithCatchupComponents is unread
block/components.go:358-371:

catchupErrCh := make(chan error, 1)
catchupSyncer := syncing.NewSyncer(..., catchupErrCh, ...)

The catchupErrCh channel is created and passed to the syncer, but no goroutine reads from it. If the catchup syncer hits a critical error, it will try to send on this channel and block (or drop, since it's buffered with size 1). After the catchup phase completes, f.bc.Syncer is set to nil, so the error is silently discarded. This means critical errors during the catchup phase (e.g., executor failures) will be swallowed. Consider draining and logging this channel, or checking it during waitForCatchup. Fix this →

✅ Good: Context handling in runCatchupPhase
The syncer is started and stopped properly, and the child context from errgroup is correctly threaded through.


Performance & Resource Efficiency

⚠️ Issue: Polling with time.After in a loop creates timer allocations
node/failover.go:278-279:

ticker := time.NewTicker(pollInterval)

The polling ticker is correctly created once. However, time.After(f.catchupTimeout) at line 284 creates a timer that is referenced by timeoutCh but once timeoutCh is set to nil (line 297), the underlying timer goroutine from time.After continues to run until it fires. For short timeout values this is negligible, but use time.NewTimer for proper cleanup. Fix this →


Testing

✅ Good: Integration tests cover both DA-only and P2P recovery paths

⚠️ Issue: Stale comment in test file
node/sequencer_recovery_integration_test.go:305:

// stopNodeAndCleanup stops a running node and calls its cleanup function.

This comment references a function (stopNodeAndCleanup) that doesn't exist in the file (noted by Gemini too). It's a leftover that should be removed. Fix this →

⚠️ Issue: Magic number 1 for StatusSuccess
node/sequencer_recovery_integration_test.go:329:

if res.Code == 1 { // StatusSuccess

Use datypes.StatusSuccess constant for readability and robustness.

⚠️ Issue: peerAddress uses peerID.Loggable()
node/sequencer_recovery_integration_test.go:253:

return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))

This relies on the internal map structure of .Loggable() and the peerID key name. Use peerID.String() instead for a stable, public API. Fix this →

⚠️ Issue: TestSequencerRecoveryFromP2P hash assertion is non-deterministic
node/sequencer_recovery_integration_test.go:205-218: The test currently checks if hashes match but logs instead of asserting when they don't, effectively making the hash verification optional. This means the test can pass even if the recovery node produced a fork from the original chain. Consider either making this a hard assertion or documenting why a fork is acceptable.

⚠️ Issue: requireNodeStartedSuccessfully uses time.Sleep
node/sequencer_recovery_integration_test.go:311:

time.Sleep(waitTime)

This creates a test timing dependency. It's better to wait for a specific observable condition (e.g., the node is serving RPC, or has produced at least one block) rather than sleeping.


Code Quality

✅ Good: Flag naming consistency
The flag is named FlagCatchupTimeout (evnode.node.catchup_timeout) which is clear and consistent with existing patterns.

✅ Good: Responding to reviewer feedback
The MarkFlagsMutuallyExclusive(FlagCatchupTimeout, FlagRaftEnable) at line 567 directly addresses julienrbrt's comment about making them mutually exclusive via Cobra. The Validate() function in config.go also provides a defense-in-depth check.

⚠️ Note on tac0turtle's architectural concern
The reviewer suggested this should be "automatic" — worth discussing whether CatchupTimeout should be implicitly enabled for all aggregators without Raft (e.g., with a very high default timeout) rather than requiring explicit opt-in. The current opt-in design is safer but may lead to operators missing it.

⚠️ Issue: da_visualization.go additions seem unrelated
The DA visualization file (pkg/rpc/server/da_visualization.go) has 262 additions. These appear to be a large self-contained HTML dashboard/API addition. It's unclear how this relates to sequencer recovery. Consider splitting into a separate PR for easier review.


Minor Nits

  • block/components.go:334: The doc comment for NewAggregatorWithCatchupComponents is helpful, but the 3-step caller contract could be enforced by a higher-level wrapper rather than documented in the callee.
  • node/failover.go:163-165: The bc.Syncer = nil line when !catchupEnabled is somewhat surprising — adding a comment explaining why nil prevents syncer from running in production aggregator mode would help.

Priority Summary

Severity Issue
🔴 High catchupErrCh is never read — critical errors during catchup are silently discarded
🟡 Medium waitForCatchup returns (false, nil) on ctx cancellation — unclear semantics
🟡 Medium TestSequencerRecoveryFromP2P hash assertion is optional — reduces test value
🟡 Medium Dead code branch catchupTimeout == 0 in waitForCatchup
🟢 Low time.After timer leak when P2P timeout is configured
🟢 Low Stale comment referencing stopNodeAndCleanup
🟢 Low Magic number 1 instead of datypes.StatusSuccess
🟢 Low peerID.Loggable() vs peerID.String()

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience of sequencer nodes by implementing a dedicated recovery mechanism. It enables a sequencer to gracefully recover from data loss by first synchronizing its state from the Data Availability layer and the P2P network, and then seamlessly resuming its role as an aggregator. This ensures continuous operation and data integrity, particularly in environments where traditional consensus mechanisms like Raft are not employed.

Highlights

  • Sequencer Recovery Mechanism: Introduced a new sequencerRecoveryElector that allows a sequencer node to start in a synchronization mode, catching up from both the Data Availability (DA) layer and the peer-to-peer (P2P) network before transitioning to its aggregator role. This is designed for single-sequencer setups without Raft.
  • DA Head Tracking: The Syncer now includes an daHeadReached atomic boolean and a HasReachedDAHead method to signal when the DA worker has successfully caught up to the DA head, which is crucial for the recovery process.
  • Configuration and Validation: A new configuration option, Node.SequencerRecovery, has been added to enable this recovery mode. Validation ensures that this mode can only be activated when the node is configured as an aggregator.
  • Integration Tests: Comprehensive integration tests have been added to verify the sequencer recovery functionality from both DA and P2P sources, ensuring robustness in disaster recovery scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • block/internal/syncing/syncer.go
    • Added daHeadReached atomic boolean to track if the DA worker has reached the DA head.
    • Updated daWorkerLoop to set daHeadReached to true when caught up.
    • Exposed HasReachedDAHead method for external checking.
  • node/failover.go
    • Introduced sequencerRecoveryElector to manage the recovery process.
    • Implemented Run, startSyncPhase, startAggregatorPhase, and waitForCatchup methods within sequencerRecoveryElector to orchestrate the sync-then-aggregate flow.
  • node/full.go
    • Modified newFullNode to conditionally use newSequencerRecoveryElector when Node.Aggregator is true, Node.SequencerRecovery.Duration is greater than zero, and Raft is not enabled.
  • node/sequencer_recovery_integration_test.go
    • Added new integration tests to verify sequencer recovery from DA and P2P.
  • pkg/config/config.go
    • Defined FlagSequencerRecovery for the new configuration option.
    • Added SequencerRecovery field to NodeConfig struct.
    • Added validation to ensure SequencerRecovery mode requires Aggregator mode.
    • Added FlagSequencerRecovery to Cobra flags.
  • pkg/config/config_test.go
    • Updated expected flag count.
  • pkg/config/defaults.go
    • Set default value for SequencerRecovery to 0.
  • test/testda/dummy.go
    • Added BlobCount method to DummyDA for testing purposes.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 49.51456% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.00%. Comparing base (1da7634) to head (9bbeeca).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/components.go 0.00% 27 Missing ⚠️
node/failover.go 71.21% 11 Missing and 8 partials ⚠️
pkg/config/config.go 33.33% 2 Missing and 2 partials ⚠️
block/internal/syncing/syncer.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3061      +/-   ##
==========================================
- Coverage   61.04%   61.00%   -0.05%     
==========================================
  Files         113      113              
  Lines       11440    11544     +104     
==========================================
+ Hits         6984     7042      +58     
- Misses       3674     3703      +29     
- Partials      782      799      +17     
Flag Coverage Δ
combined 61.00% <49.51%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sequencer recovery mechanism, a valuable feature for disaster recovery. The implementation allows a sequencer to start in sync mode, catch up from DA and P2P, and then transition to an aggregator. The logic is well-structured within the new sequencerRecoveryElector, and the catch-up conditions are robust, including a P2P timeout. The addition of comprehensive integration tests for both DA-only and P2P-assisted recovery scenarios is commendable. I've identified a few minor areas for improvement in the test code to enhance maintainability and robustness.

t.Helper()
peerID, err := peer.IDFromPrivateKey(nodeKey.PrivKey)
require.NoError(t, err)
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The peer.ID.Loggable() method is primarily for logging purposes, and its output format may not be stable across library versions. It's more robust to use peer.ID.String(), which is the public API for getting the canonical string representation of a peer ID.

Suggested change
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.Loggable()["peerID"].(string))
return fmt.Sprintf("%s/p2p/%s", listenAddr, peerID.String())

return recNode.(*FullNode), recStopTicker
}

// stopNodeAndCleanup stops a running node and calls its cleanup function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment refers to a function stopNodeAndCleanup that doesn't exist in the file. It appears to be a leftover and should be removed to avoid confusion.

Comment on lines +305 to +313
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 1 is used to check for success status. For better readability and maintainability, please use the datypes.StatusSuccess constant. You'll need to add the following import:

import (
	// ... other imports
	datypes "github.com/evstack/ev-node/pkg/da/types"
)
Suggested change
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == 1 { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}
if res.Code == datypes.StatusSuccess { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d header blob(s) at DA height %d (Success)", len(res.Data), h)
}
res = sharedDA.Retrieve(context.Background(), h, sharedDA.GetDataNamespace())
if res.Code == datypes.StatusSuccess { // StatusSuccess
blobsFound += len(res.Data)
t.Logf("DIAG: found %d data blob(s) at DA height %d (Success)", len(res.Data), h)
}

cmd.Flags().Uint64(FlagReadinessWindowSeconds, def.Node.ReadinessWindowSeconds, "time window in seconds for calculating readiness threshold based on block time (default: 15s)")
cmd.Flags().Uint64(FlagReadinessMaxBlocksBehind, def.Node.ReadinessMaxBlocksBehind, "how many blocks behind best-known head the node can be and still be considered ready (0 = must be at head)")
cmd.Flags().Duration(FlagScrapeInterval, def.Node.ScrapeInterval.Duration, "interval at which the reaper polls the execution layer for new transactions")
cmd.Flags().Duration(FlagSequencerRecovery, def.Node.SequencerRecovery.Duration, "start in sync mode first, catch up from DA/P2P, then switch to aggregator (disaster recovery). Value specifies time to wait for in-flight P2P blocks.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be a flag, it should be automatic

cmd.Flags().Uint64(FlagReadinessWindowSeconds, def.Node.ReadinessWindowSeconds, "time window in seconds for calculating readiness threshold based on block time (default: 15s)")
cmd.Flags().Uint64(FlagReadinessMaxBlocksBehind, def.Node.ReadinessMaxBlocksBehind, "how many blocks behind best-known head the node can be and still be considered ready (0 = must be at head)")
cmd.Flags().Duration(FlagScrapeInterval, def.Node.ScrapeInterval.Duration, "interval at which the reaper polls the execution layer for new transactions")
cmd.Flags().Duration(FlagSequencerRecovery, def.Node.SequencerRecovery.Duration, "start in sync mode first, catch up from DA/P2P, then switch to aggregator (disaster recovery). Value specifies time to wait for in-flight P2P blocks.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can we make nodeConfig.Node.SequencerRecovery.Duration and Raft.Enable mutually exclusive? This is a cobra option

node/failover.go Outdated

// sequencerRecoveryElector implements leaderElection for disaster recovery.
// It starts in sync mode (follower), catches up from DA and P2P, then switches to aggregator (leader) mode.
// This is for single-sequencer setups that don't use raft.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a code smell. i dont follow why this needs to be its own flow. it should be merely, do i have a signing key, is the aggregator flag on, what is the head block, which we can grab from p2p, or checking where da sync is in relation to the head of the da chain

this feature should be combined with sync vs follow. they tie together but arent blocked on each other.

alpe added 2 commits February 17, 2026 10:06
* main:
  refactor: reuse P2P client across node modes (#3065)
  refactor: store pending block separately (#3073)
  chore: Fix mismatched comment in TestCache_WithNilStore function (#3074)
  build(deps): Bump github.com/pion/dtls/v3 from 3.0.6 to 3.0.11 (#3068)
  feat: block Pruning (#2984)
  refactor: omit unnecessary reassignment (#3067)
  refactor: use WaitGroup.Go to simplify code (#3059)
  chore: prep rc.4 (#3062)
  fix(syncing): skip draining when exec client unavailable (#3060)
  build(deps): Bump the all-go group across 4 directories with 5 updates (#3058)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow running a sequencer node that syncs from DA first in full node mode

3 participants