Ldz/fix resolve lock0405 by lidezhu · Pull Request #4728 · pingcap/ticdc

lidezhu · 2026-04-05T13:33:16Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

Tests
- Added unit tests for transaction lock resolution retry logic, including scenarios for handling full pages with live locks and resolved tail locks.

ti-chi-bot · 2026-04-05T13:33:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-05T13:33:31Z

📝 Walkthrough

Walkthrough

The lock resolution mechanism is enhanced with a nextScanLockKey helper function that determines the next scan position based on lock count and timing information (msBeforeExpired). The control flow now conditionally retries with backoff when msBeforeExpired > 0, or computes the next key for continued scanning. Unit tests validate the new behavior.

Changes

Cohort / File(s)	Summary
Lock resolver implementation `logservice/txnutil/lock_resolver.go`	Added `nextScanLockKey` helper function to determine next scan key and retry logic based on lock count and `msBeforeExpired` timing. Refactored control flow to use new backoff path (`BackoffWithMaxSleepTxnLockFast`) when `msBeforeExpired > 0`, or advance to next key when limit reached.
Lock resolver tests `logservice/txnutil/lock_resolver_test.go`	Added two unit tests for `nextScanLockKey`: one validating retry behavior with live locks at boundary, another testing key advancement past resolved tail lock.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

hongyunyan
wlwilliamx

Poem

🐰 A rabbit hops through locks so tight,
With timing checks and retry flight,
nextScanLockKey finds the way,
Past frozen keys, no more delay,
The resolver hops from test to test! ✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description uses the template but is largely empty; critical sections like 'What is changed and how it works?' contain no content, and the Issue Number field has an unresolved placeholder.	Fill in the PR description with actual problem details, implementation explanation, and link to a real issue number instead of the 'close `#xxx`' placeholder.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Ldz/fix resolve lock0405' is vague and uses a branch name format rather than a descriptive summary; it doesn't clearly convey what the fix addresses.	Use a descriptive title that clearly explains the fix, e.g., 'Improve lock resolution with backoff strategy and next key computation' or similar.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ldz/fix-resolve-lock0405

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

This reverts commit 9999bcb.

This reverts commit d408525.

gemini-code-assist

Code Review

This pull request removes several performance metrics from the logpuller package to simplify the code and reduce overhead. It also refactors the lock resolution logic in txnutil/lock_resolver.go by introducing a nextScanLockKey helper and updating the Resolve loop to handle retries when locks remain unexpired. Feedback focuses on the new lock resolution logic: specifically, ensuring that the retry condition for unexpired locks is prioritized over batch size checks, avoiding the reset of the backoffer state within the loop to maintain timeout limits, and preventing duplicate lock entries in the totalLocks slice during retries.

gemini-code-assist · 2026-04-05T13:35:29Z

logservice/txnutil/lock_resolver.go

+func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
+	if len(locks) < scanLockLimit {
+		return endKey, false
+	}
+	if msBeforeExpired > 0 {
+		return nil, true
+	}
+	return tikvkv.NextKey(locks[len(locks)-1].Key), false
+}


The current implementation of nextScanLockKey only retries if the batch is full (len(locks) == scanLockLimit). If a batch is partially full but contains unexpired locks (msBeforeExpired > 0), it returns shouldRetry = false, causing the resolution to finish prematurely while locks still exist in the region. This could prevent the resolved TS from advancing correctly. The check for msBeforeExpired > 0 should be prioritized to ensure all locks are resolved regardless of the batch size.

Suggested change

func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {

if len(locks) < scanLockLimit {

return endKey, false

}

if msBeforeExpired > 0 {

return nil, true

}

return tikvkv.NextKey(locks[len(locks)-1].Key), false

}

func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {

if msBeforeExpired > 0 {

return nil, true

}

if len(locks) < scanLockLimit {

return endKey, false

}

return tikvkv.NextKey(locks[len(locks)-1].Key), false

}

gemini-code-assist · 2026-04-05T13:35:29Z

logservice/txnutil/lock_resolver.go

+			if err != nil {
+				return errors.Trace(err)
+			}
+			bo = tikv.NewGcResolveLockMaxBackoffer(ctx)


Resetting the backoffer bo here (and at line 174) is problematic because it clears the cumulative backoff state (such as total sleep time and retry counts) for the region resolution operation. This can lead to excessive retries and bypass the intended timeout limits. The backoffer should be initialized once outside the loop and reused for all batches and retries within the same Resolve call.

Suggested change

bo = tikv.NewGcResolveLockMaxBackoffer(ctx)

continue

gemini-code-assist · 2026-04-05T13:35:29Z

logservice/txnutil/lock_resolver.go

+		nextKey, shouldRetry := nextScanLockKey(locks, loc.EndKey, msBeforeExpired)
+		if shouldRetry {


When shouldRetry is true, the loop continues without updating key, which is correct for retrying the same batch. However, because totalLocks = append(totalLocks, locks...) is called at line 150 (before the retry check), each retry will append the same (or similar) set of locks to totalLocks. This results in duplicate entries and an incorrect lockCount in the final log message. Consider moving the append logic to only occur when the scan is successfully moving forward.

ti-chi-bot · 2026-04-05T13:40:00Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 48-55: The function nextScanLockKey currently returns endKey when
len(locks) < scanLockLimit before checking msBeforeExpired, causing premature
exit even if locks are still expired and a retry is required; fix by reordering
checks in nextScanLockKey: first if msBeforeExpired > 0 return (nil, true) to
signal caller to retry/sleep, then if len(locks) < scanLockLimit return (endKey,
false), else return (tikvkv.NextKey(locks[len(locks)-1].Key), false); also add a
regression test covering len(locks) < scanLockLimit && msBeforeExpired > 0 to
ensure Resolve retry behavior remains correct.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5da9ae1-b0d0-41cb-8933-1e078c71a6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 165df37 and 11b804c.

📒 Files selected for processing (2)

logservice/txnutil/lock_resolver.go
logservice/txnutil/lock_resolver_test.go

coderabbitai · 2026-04-05T13:46:50Z

logservice/txnutil/lock_resolver.go

+func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
+	if len(locks) < scanLockLimit {
+		return endKey, false
+	}
+	if msBeforeExpired > 0 {
+		return nil, true
+	}
+	return tikvkv.NextKey(locks[len(locks)-1].Key), false


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What do the docs for github.com/tikv/client-go/v2/txnkv/txnlock.(*LockResolver).ResolveLocks say callers should do when not all scanned locks are expired/resolved, and does that retry requirement depend on the current page size?

💡 Result:

The documentation for github.com/tikv/client-go/v2/txnkv/txnlock.(*LockResolver).ResolveLocks indicates that when not all scanned locks are expired or resolved, the caller should retry the operation. This is based on the method's signature ResolveLocks(bo *Backoffer, locks []*Lock) (ok bool, err error), where the bool return value 'ok' is true only if all locks are successfully resolved (expired locks cleaned up and non-expired ones confirmed via txn status). If ok is false, it means some locks remain unresolvable (e.g., not expired), and the Backoffer handles retries via backoff. This retry requirement does not depend on the current page size; page size affects batching/scanning but the resolution logic checks all provided locks independently of batch size. The TiKV resolve lock process scans in batches (e.g., RESOLVE_LOCK_BATCH_SIZE) to avoid huge writes, but client-go's ResolveLocks handles a list of locks and returns partial success via the bool.

Citations:

1: https://pkg.go.dev/github.com/tikv/client-go/v2/txnkv

2: https://pkg.go.dev/github.com/pingcap/tidb/store/tikv

3: https://pingcap.github.io/tidb-dev-guide/understand-tidb/lock-resolver.html

Reorder checks to ensure retry when locks remain unresolved, regardless of page size.

The current code returns endKey, false when len(locks) < scanLockLimit, which exits the scan loop prematurely. However, if msBeforeExpired > 0, locks are still unresolved and the caller must retry—this requirement is independent of page size. Returning endKey when unresolved locks remain can cause Resolve to exit the loop early without the required retry/sleep, violating the ResolveLocks contract. Check msBeforeExpired > 0 first (return nil, true to signal retry), then check the page length. Add a regression test for the case where len(locks) < scanLockLimit && msBeforeExpired > 0 to ensure the loop persists correctly.

Minimal fix

func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) { - if len(locks) < scanLockLimit { - return endKey, false - } if msBeforeExpired > 0 { return nil, true } + if len(locks) < scanLockLimit { + return endKey, false + } return tikvkv.NextKey(locks[len(locks)-1].Key), false }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {

if len(locks) < scanLockLimit {

return endKey, false

}

if msBeforeExpired > 0 {

return nil, true

}

return tikvkv.NextKey(locks[len(locks)-1].Key), false

func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {

if msBeforeExpired > 0 {

return nil, true

}

if len(locks) < scanLockLimit {

return endKey, false

}

return tikvkv.NextKey(locks[len(locks)-1].Key), false

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@logservice/txnutil/lock_resolver.go` around lines 48 - 55, The function nextScanLockKey currently returns endKey when len(locks) < scanLockLimit before checking msBeforeExpired, causing premature exit even if locks are still expired and a retry is required; fix by reordering checks in nextScanLockKey: first if msBeforeExpired > 0 return (nil, true) to signal caller to retry/sleep, then if len(locks) < scanLockLimit return (endKey, false), else return (tikvkv.NextKey(locks[len(locks)-1].Key), false); also add a regression test covering len(locks) < scanLockLimit && msBeforeExpired > 0 to ensure Resolve retry behavior remains correct.

lidezhu added 3 commits April 5, 2026 16:06

remove unnecessary metrics

d408525

f

9999bcb

f

33babc8

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 5, 2026

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 5, 2026

lidezhu added 2 commits April 5, 2026 21:33

Revert "f"

5d8e75d

This reverts commit 9999bcb.

Revert "remove unnecessary metrics"

11b804c

This reverts commit d408525.

ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2026

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

lidezhu closed this Apr 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ldz/fix resolve lock0405#4728

Ldz/fix resolve lock0405#4728
lidezhu wants to merge 5 commits intomasterfrom
ldz/fix-resolve-lock0405

lidezhu commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ti-chi-bot bot commented Apr 5, 2026

Uh oh!

coderabbitai bot commented Apr 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

ti-chi-bot bot commented Apr 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		nextKey, shouldRetry := nextScanLockKey(locks, loc.EndKey, msBeforeExpired)
		if shouldRetry {

Conversation

lidezhu commented Apr 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Apr 5, 2026

Uh oh!

coderabbitai bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Apr 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lidezhu commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 5, 2026 •

edited

Loading