fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration by basvandijk · Pull Request #9242 · dfinity/ic

basvandijk · 2026-03-07T20:25:48Z

Root Cause

The integration tests in registry_replicator_integration used sleep(poll_delay + 200ms) and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:

assert_eq left=3 right=4 — The background poll hadn't updated the registry client to the expected version within the sleep window.
get_latest_certified_time > time_after_init — The poll hadn't completed to update the certified time within the sleep window.
HTTP timeout on get_certified_changes_since — PocketIC HTTP requests timing out under parallel test load.

Fix

Replace timing-dependent assertions with condition-based waits:

get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.
wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).
wait_for_not_polling: Polls until is_polling() returns false (30s timeout).
wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).

Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…r_integration Replace timing-dependent assertions with condition-based waits and add retry logic for HTTP calls in the registry replicator integration tests. Root cause: The tests used sleep(poll_delay + 200ms) and assumed the background poller had completed, but under CI load the 200ms leeway was insufficient. Additionally, HTTP requests to PocketIC could time out under parallel test load. Specific failures: polling loop hadn't exited yet within the sleep window - assert_eq left=3 right=4 - the background poll hadn't updated the registry client to the expected version within the sleep window - get_latest_certified_time > time_after_init - the poll hadn't completed to update the certified time within the sleep window - get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing out under parallel test load Changes: - Add retry logic (5 attempts) to get_all_certified_records to handle transient HTTP timeouts - Add wait_for_replicator_version helper that polls until the registry client reaches the expected version - Add wait_for_not_polling helper that polls until is_polling() returns false - Add wait_for_certified_time_gt helper that polls until certified time exceeds the given threshold - Replace sleep+assert patterns with condition-based waits where the test expects a positive outcome (e.g. replicator is up to date) - Keep sleep-based assertions only for negative checks (asserting something has NOT happened), which are safe because the poll interval is 1 second This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

Copilot

Pull request overview

Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.

Changes:

Add retry-with-backoff to PocketIcHelper::get_all_certified_records to handle transient get_certified_changes_since failures/timeouts.
Introduce condition-wait helpers (wait_for_replicator_version, wait_for_not_polling, wait_for_certified_time_gt) with a 30s timeout.
Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rs/orchestrator/registry_replicator/tests/test.rs

- Only retry on transient RegistryUnreachable errors in get_all_certified_records, fail fast on deterministic errors. Skip backoff sleep on final attempt. - Include current version in wait_for_replicator_version timeout message. - Replace ad-hoc version wait loops with wait_for_replicator_version helper.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T20:47:17Z

rs/orchestrator/registry_replicator/tests/test.rs

    let new_record = random_mutate(&pocket_ic, &mut rng).await;

    // Even though, we started polling, we haven't waited for the poll delay yet, so local store
    // and client should still contain the previous state
    assert_replicator_not_up_to_date_yet(&replicator, latest_version, &records, &new_record);

-    tokio::time::sleep(replicator.get_poll_delay() + DELAY_LEEWAY).await;
+    // Wait until background polling picks up the new version
+    wait_for_replicator_version(&replicator, new_record.version).await;


In test_poll_and_start_polling_and_stop_polling..., the negative assertion right after let new_record = random_mutate(...) assumes the background poller cannot complete another poll before the assertion runs. With the new retry/backoff inside get_all_certified_records() (which random_mutate() calls), random_mutate() can now take longer than the 1s TEST_POLL_DELAY, so the background loop in start_polling() may run another poll and update to new_record.version before assert_replicator_not_up_to_date_yet(...) executes. To keep this test deterministic under load, consider removing this negative check, increasing the poll delay for this test, or restructuring the mutation so it doesn’t include potentially-slow retries between the mutation and the assertion.

The two assert_replicator_not_up_to_date_yet calls that run while background polling is active are inherently non-deterministic: with retry logic in get_all_certified_records(), random_mutate() can take longer than the 1s TEST_POLL_DELAY, allowing the background poller to update before the assertion runs. Remove these two assertions. The test still has 4 deterministic negative assertions (before start_polling and after stop_polling) plus positive assertions that background polling eventually picks up changes.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rs/orchestrator/registry_replicator/tests/test.rs

RegistryCanister::get_certified_changes_since maps query failures/timeouts (and missing responses) to Error::UnknownError, so only retrying on RegistryUnreachable would still fail immediately on the PocketIC HTTP timeouts this PR targets.

github-actions bot added the fix label Mar 7, 2026

basvandijk requested a review from Copilot March 7, 2026 20:26

Copilot started reviewing on behalf of basvandijk March 7, 2026 20:26 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Show resolved Hide resolved

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

basvandijk requested a review from Copilot March 7, 2026 20:42

Copilot started reviewing on behalf of basvandijk March 7, 2026 20:43 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

basvandijk requested a review from Copilot March 7, 2026 21:12

Copilot started reviewing on behalf of basvandijk March 7, 2026 21:13 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

rs/orchestrator/registry_replicator/tests/test.rs Outdated Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
basvandijk wants to merge 4 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07

basvandijk commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

basvandijk Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

basvandijk commented Mar 7, 2026

Root Cause

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

basvandijk Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants