fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242
fix: deflake //rs/orchestrator/registry_replicator:registry_replicator_integration#9242basvandijk wants to merge 4 commits intomasterfrom
Conversation
…r_integration Replace timing-dependent assertions with condition-based waits and add retry logic for HTTP calls in the registry replicator integration tests. Root cause: The tests used sleep(poll_delay + 200ms) and assumed the background poller had completed, but under CI load the 200ms leeway was insufficient. Additionally, HTTP requests to PocketIC could time out under parallel test load. Specific failures: polling loop hadn't exited yet within the sleep window - assert_eq left=3 right=4 - the background poll hadn't updated the registry client to the expected version within the sleep window - get_latest_certified_time > time_after_init - the poll hadn't completed to update the certified time within the sleep window - get_certified_changes_since HTTP timeout - PocketIC HTTP requests timing out under parallel test load Changes: - Add retry logic (5 attempts) to get_all_certified_records to handle transient HTTP timeouts - Add wait_for_replicator_version helper that polls until the registry client reaches the expected version - Add wait_for_not_polling helper that polls until is_polling() returns false - Add wait_for_certified_time_gt helper that polls until certified time exceeds the given threshold - Replace sleep+assert patterns with condition-based waits where the test expects a positive outcome (e.g. replicator is up to date) - Keep sleep-based assertions only for negative checks (asserting something has NOT happened), which are safe because the poll interval is 1 second This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.
There was a problem hiding this comment.
Pull request overview
Deflakes the registry_replicator_integration tests by replacing timing-dependent sleep(...) assertions with condition-based waiting, and by adding small retry logic around PocketIC registry queries to better tolerate transient CI/network load.
Changes:
- Add retry-with-backoff to
PocketIcHelper::get_all_certified_recordsto handle transientget_certified_changes_sincefailures/timeouts. - Introduce condition-wait helpers (
wait_for_replicator_version,wait_for_not_polling,wait_for_certified_time_gt) with a 30s timeout. - Update integration tests to use the new condition-waits instead of fixed sleeps for positive assertions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Only retry on transient RegistryUnreachable errors in get_all_certified_records, fail fast on deterministic errors. Skip backoff sleep on final attempt. - Include current version in wait_for_replicator_version timeout message. - Replace ad-hoc version wait loops with wait_for_replicator_version helper.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let new_record = random_mutate(&pocket_ic, &mut rng).await; | ||
|
|
||
| // Even though, we started polling, we haven't waited for the poll delay yet, so local store | ||
| // and client should still contain the previous state | ||
| assert_replicator_not_up_to_date_yet(&replicator, latest_version, &records, &new_record); | ||
|
|
||
| tokio::time::sleep(replicator.get_poll_delay() + DELAY_LEEWAY).await; | ||
| // Wait until background polling picks up the new version | ||
| wait_for_replicator_version(&replicator, new_record.version).await; |
There was a problem hiding this comment.
In test_poll_and_start_polling_and_stop_polling..., the negative assertion right after let new_record = random_mutate(...) assumes the background poller cannot complete another poll before the assertion runs. With the new retry/backoff inside get_all_certified_records() (which random_mutate() calls), random_mutate() can now take longer than the 1s TEST_POLL_DELAY, so the background loop in start_polling() may run another poll and update to new_record.version before assert_replicator_not_up_to_date_yet(...) executes. To keep this test deterministic under load, consider removing this negative check, increasing the poll delay for this test, or restructuring the mutation so it doesn’t include potentially-slow retries between the mutation and the assertion.
The two assert_replicator_not_up_to_date_yet calls that run while background polling is active are inherently non-deterministic: with retry logic in get_all_certified_records(), random_mutate() can take longer than the 1s TEST_POLL_DELAY, allowing the background poller to update before the assertion runs. Remove these two assertions. The test still has 4 deterministic negative assertions (before start_polling and after stop_polling) plus positive assertions that background polling eventually picks up changes.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
RegistryCanister::get_certified_changes_since maps query failures/timeouts (and missing responses) to Error::UnknownError, so only retrying on RegistryUnreachable would still fail immediately on the PocketIC HTTP timeouts this PR targets.
Root Cause
The integration tests in
registry_replicator_integrationusedsleep(poll_delay + 200ms)and assumed the background poller had completed within that window. Under CI load, the 200ms leeway was insufficient, causing multiple failure modes:assert_eq left=3 right=4— The background poll hadn't updated the registry client to the expected version within the sleep window.get_latest_certified_time > time_after_init— The poll hadn't completed to update the certified time within the sleep window.get_certified_changes_since— PocketIC HTTP requests timing out under parallel test load.Fix
Replace timing-dependent assertions with condition-based waits:
get_all_certified_records: Add retry logic (5 attempts with 500ms backoff) to handle transient HTTP timeouts.wait_for_replicator_version: Polls until the registry client reaches the expected version (30s timeout).wait_for_not_polling: Polls untilis_polling()returns false (30s timeout).wait_for_certified_time_gt: Polls until the certified time exceeds the given threshold (30s timeout).Sleep-based assertions are kept only for negative checks (asserting something has NOT happened), where they are safe because the poll interval is 1 second and the operations between mutation and assertion are fast.
This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.