Skip to content

Ldz/fix resolve lock0405#4728

Closed
lidezhu wants to merge 5 commits intomasterfrom
ldz/fix-resolve-lock0405
Closed

Ldz/fix resolve lock0405#4728
lidezhu wants to merge 5 commits intomasterfrom
ldz/fix-resolve-lock0405

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented Apr 5, 2026

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Tests
    • Added unit tests for transaction lock resolution retry logic, including scenarios for handling full pages with live locks and resolved tail locks.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 5, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 5, 2026

📝 Walkthrough

Walkthrough

The lock resolution mechanism is enhanced with a nextScanLockKey helper function that determines the next scan position based on lock count and timing information (msBeforeExpired). The control flow now conditionally retries with backoff when msBeforeExpired > 0, or computes the next key for continued scanning. Unit tests validate the new behavior.

Changes

Cohort / File(s) Summary
Lock resolver implementation
logservice/txnutil/lock_resolver.go
Added nextScanLockKey helper function to determine next scan key and retry logic based on lock count and msBeforeExpired timing. Refactored control flow to use new backoff path (BackoffWithMaxSleepTxnLockFast) when msBeforeExpired > 0, or advance to next key when limit reached.
Lock resolver tests
logservice/txnutil/lock_resolver_test.go
Added two unit tests for nextScanLockKey: one validating retry behavior with live locks at boundary, another testing key advancement past resolved tail lock.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • hongyunyan
  • wlwilliamx

Poem

🐰 A rabbit hops through locks so tight,
With timing checks and retry flight,
nextScanLockKey finds the way,
Past frozen keys, no more delay,
The resolver hops from test to test! ✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description uses the template but is largely empty; critical sections like 'What is changed and how it works?' contain no content, and the Issue Number field has an unresolved placeholder. Fill in the PR description with actual problem details, implementation explanation, and link to a real issue number instead of the 'close #xxx' placeholder.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Ldz/fix resolve lock0405' is vague and uses a branch name format rather than a descriptive summary; it doesn't clearly convey what the fix addresses. Use a descriptive title that clearly explains the fix, e.g., 'Improve lock resolution with backoff strategy and next key computation' or similar.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ldz/fix-resolve-lock0405

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 5, 2026
lidezhu added 2 commits April 5, 2026 21:33
This reverts commit 9999bcb.
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes several performance metrics from the logpuller package to simplify the code and reduce overhead. It also refactors the lock resolution logic in txnutil/lock_resolver.go by introducing a nextScanLockKey helper and updating the Resolve loop to handle retries when locks remain unexpired. Feedback focuses on the new lock resolution logic: specifically, ensuring that the retry condition for unexpired locks is prioritized over batch size checks, avoiding the reset of the backoffer state within the loop to maintain timeout limits, and preventing duplicate lock entries in the totalLocks slice during retries.

Comment on lines +48 to +56
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if len(locks) < scanLockLimit {
return endKey, false
}
if msBeforeExpired > 0 {
return nil, true
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of nextScanLockKey only retries if the batch is full (len(locks) == scanLockLimit). If a batch is partially full but contains unexpired locks (msBeforeExpired > 0), it returns shouldRetry = false, causing the resolution to finish prematurely while locks still exist in the region. This could prevent the resolved TS from advancing correctly. The check for msBeforeExpired > 0 should be prioritized to ensure all locks are resolved regardless of the batch size.

Suggested change
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if len(locks) < scanLockLimit {
return endKey, false
}
if msBeforeExpired > 0 {
return nil, true
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
}
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if msBeforeExpired > 0 {
return nil, true
}
if len(locks) < scanLockLimit {
return endKey, false
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
}

if err != nil {
return errors.Trace(err)
}
bo = tikv.NewGcResolveLockMaxBackoffer(ctx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Resetting the backoffer bo here (and at line 174) is problematic because it clears the cumulative backoff state (such as total sleep time and retry counts) for the region resolution operation. This can lead to excessive retries and bypass the intended timeout limits. The backoffer should be initialized once outside the loop and reused for all batches and retries within the same Resolve call.

Suggested change
bo = tikv.NewGcResolveLockMaxBackoffer(ctx)
continue

Comment on lines +157 to +158
nextKey, shouldRetry := nextScanLockKey(locks, loc.EndKey, msBeforeExpired)
if shouldRetry {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When shouldRetry is true, the loop continues without updating key, which is correct for retrying the same batch. However, because totalLocks = append(totalLocks, locks...) is called at line 150 (before the retry check), each retry will append the same (or similar) set of locks to totalLocks. This results in duplicate entries and an incorrect lockCount in the final log message. Consider moving the append logic to only occur when the scan is successfully moving forward.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 5, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 48-55: The function nextScanLockKey currently returns endKey when
len(locks) < scanLockLimit before checking msBeforeExpired, causing premature
exit even if locks are still expired and a retry is required; fix by reordering
checks in nextScanLockKey: first if msBeforeExpired > 0 return (nil, true) to
signal caller to retry/sleep, then if len(locks) < scanLockLimit return (endKey,
false), else return (tikvkv.NextKey(locks[len(locks)-1].Key), false); also add a
regression test covering len(locks) < scanLockLimit && msBeforeExpired > 0 to
ensure Resolve retry behavior remains correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5da9ae1-b0d0-41cb-8933-1e078c71a6ca

📥 Commits

Reviewing files that changed from the base of the PR and between 165df37 and 11b804c.

📒 Files selected for processing (2)
  • logservice/txnutil/lock_resolver.go
  • logservice/txnutil/lock_resolver_test.go

Comment on lines +48 to +55
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if len(locks) < scanLockLimit {
return endKey, false
}
if msBeforeExpired > 0 {
return nil, true
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What do the docs for github.com/tikv/client-go/v2/txnkv/txnlock.(*LockResolver).ResolveLocks say callers should do when not all scanned locks are expired/resolved, and does that retry requirement depend on the current page size?

💡 Result:

The documentation for github.com/tikv/client-go/v2/txnkv/txnlock.(*LockResolver).ResolveLocks indicates that when not all scanned locks are expired or resolved, the caller should retry the operation. This is based on the method's signature ResolveLocks(bo *Backoffer, locks []*Lock) (ok bool, err error), where the bool return value 'ok' is true only if all locks are successfully resolved (expired locks cleaned up and non-expired ones confirmed via txn status). If ok is false, it means some locks remain unresolvable (e.g., not expired), and the Backoffer handles retries via backoff. This retry requirement does not depend on the current page size; page size affects batching/scanning but the resolution logic checks all provided locks independently of batch size. The TiKV resolve lock process scans in batches (e.g., RESOLVE_LOCK_BATCH_SIZE) to avoid huge writes, but client-go's ResolveLocks handles a list of locks and returns partial success via the bool.

Citations:


Reorder checks to ensure retry when locks remain unresolved, regardless of page size.

The current code returns endKey, false when len(locks) < scanLockLimit, which exits the scan loop prematurely. However, if msBeforeExpired > 0, locks are still unresolved and the caller must retry—this requirement is independent of page size. Returning endKey when unresolved locks remain can cause Resolve to exit the loop early without the required retry/sleep, violating the ResolveLocks contract. Check msBeforeExpired > 0 first (return nil, true to signal retry), then check the page length. Add a regression test for the case where len(locks) < scanLockLimit && msBeforeExpired > 0 to ensure the loop persists correctly.

Minimal fix
 func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
-	if len(locks) < scanLockLimit {
-		return endKey, false
-	}
 	if msBeforeExpired > 0 {
 		return nil, true
 	}
+	if len(locks) < scanLockLimit {
+		return endKey, false
+	}
 	return tikvkv.NextKey(locks[len(locks)-1].Key), false
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if len(locks) < scanLockLimit {
return endKey, false
}
if msBeforeExpired > 0 {
return nil, true
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
func nextScanLockKey(locks []*txnkv.Lock, endKey []byte, msBeforeExpired int64) ([]byte, bool) {
if msBeforeExpired > 0 {
return nil, true
}
if len(locks) < scanLockLimit {
return endKey, false
}
return tikvkv.NextKey(locks[len(locks)-1].Key), false
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/txnutil/lock_resolver.go` around lines 48 - 55, The function
nextScanLockKey currently returns endKey when len(locks) < scanLockLimit before
checking msBeforeExpired, causing premature exit even if locks are still expired
and a retry is required; fix by reordering checks in nextScanLockKey: first if
msBeforeExpired > 0 return (nil, true) to signal caller to retry/sleep, then if
len(locks) < scanLockLimit return (endKey, false), else return
(tikvkv.NextKey(locks[len(locks)-1].Key), false); also add a regression test
covering len(locks) < scanLockLimit && msBeforeExpired > 0 to ensure Resolve
retry behavior remains correct.

@lidezhu lidezhu closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant