Skip to content

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242

Draft
basvandijk wants to merge 4 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07
Draft

fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
basvandijk wants to merge 4 commits intomasterfrom
ai/deflake-registry_replicator_integration-2026-03-07

Conversation

@basvandijk
Copy link
Collaborator

Root Cause

The integration tests in registry_replicator_integration used sleep(poll_delay + 200ms) and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:

  • assert_eq left=3 right=4 — The background poll hadn't updated the registry client to the expected version within the sleep window.
  • get_latest_certified_time > time_after_init — The poll hadn't completed to update the certified time within the sleep window.
  • HTTP timeout on get_certified_changes_since — PocketIC HTTP requests timing out under parallel test load.

Fix

Replace timing-dependent assertions with condition-based waits:

  • get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.
  • wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).
  • wait_for_not_polling: Polls until is_polling() returns false (30s timeout).
  • wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).

Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.


This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…r_integration

Replace timing-dependent assertions with condition-based waits and add
retry logic for HTTP calls in the registry replicator integration tests.

Root cause: The tests used sleep(poll_delay + 200ms) and assumed the
background poller had completed, but under CI load the 200ms leeway was
insufficient. Additionally, HTTP requests to PocketIC could time out under
parallel test load.

Specific failures:
  polling loop hadn't exited yet within the sleep window
- assert_eq left=3 right=4 - the background poll hadn't updated the
  registry client to the expected version within the sleep window
- get_latest_certified_time > time_after_init - the poll hadn't completed
  to update the certified time within the sleep window
- get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing
  out under parallel test load

Changes:
- Add retry logic (5 attempts) to get_all_certified_records to handle
  transient HTTP timeouts
- Add wait_for_replicator_version helper that polls until the registry
  client reaches the expected version
- Add wait_for_not_polling helper that polls until is_polling() returns
  false
- Add wait_for_certified_time_gt helper that polls until certified time
  exceeds the given threshold
- Replace sleep+assert patterns with condition-based waits where the test
  expects a positive outcome (e.g. replicator is up to date)
- Keep sleep-based assertions only for negative checks (asserting something
  has NOT happened), which are safe because the poll interval is 1 second

This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.

Changes:

  • Add retry-with-backoff to PocketIcHelper::get_all_certified_records to handle transient get_certified_changes_since failures/timeouts.
  • Introduce condition-wait helpers (wait_for_replicator_version, wait_for_not_polling, wait_for_certified_time_gt) with a 30s timeout.
  • Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Only retry on transient RegistryUnreachable errors in get_all_certified_records,
  fail fast on deterministic errors. Skip backoff sleep on final attempt.
- Include current version in wait_for_replicator_version timeout message.
- Replace ad-hoc version wait loops with wait_for_replicator_version helper.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 429 to +436
let new_record = random_mutate(&pocket_ic, &mut rng).await;

// Even though, we started polling, we haven't waited for the poll delay yet, so local store
// and client should still contain the previous state
assert_replicator_not_up_to_date_yet(&replicator, latest_version, &records, &new_record);

tokio::time::sleep(replicator.get_poll_delay() + DELAY_LEEWAY).await;
// Wait until background polling picks up the new version
wait_for_replicator_version(&replicator, new_record.version).await;
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_poll_and_start_polling_and_stop_polling..., the negative assertion right after let new_record = random_mutate(...) assumes the background poller cannot complete another poll before the assertion runs. With the new retry/backoff inside get_all_certified_records() (which random_mutate() calls), random_mutate() can now take longer than the 1s TEST_POLL_DELAY, so the background loop in start_polling() may run another poll and update to new_record.version before assert_replicator_not_up_to_date_yet(...) executes. To keep this test deterministic under load, consider removing this negative check, increasing the poll delay for this test, or restructuring the mutation so it doesn’t include potentially-slow retries between the mutation and the assertion.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The two assert_replicator_not_up_to_date_yet calls that run while
background polling is active are inherently non-deterministic: with
retry logic in get_all_certified_records(), random_mutate() can take
longer than the 1s TEST_POLL_DELAY, allowing the background poller to
update before the assertion runs.

Remove these two assertions. The test still has 4 deterministic negative
assertions (before start_polling and after stop_polling) plus positive
assertions that background polling eventually picks up changes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

RegistryCanister::get_certified_changes_since maps query failures/timeouts
(and missing responses) to Error::UnknownError, so only retrying on
RegistryUnreachable would still fail immediately on the PocketIC HTTP
timeouts this PR targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants