HDDS-14768. Fix lock leak during snapshot cache cleanup and handle eviction race appropriately. by SaketaChalamchala · Pull Request #9869 · apache/ozone

SaketaChalamchala · 2026-03-05T21:23:11Z

What changes were proposed in this pull request?

Currently,

Eviction race: SnapshotCache cleanup throws an IllegalStateException when it finds stale entries in pendingEvictionQueue for snapshots that have already been removed from dbMap
Ex., say SnapshotPurge invalidates the entry right before the last thread with a reference to the snapshot just closes adding the snapshotID back to the evictionQueue
Inconsistent Bookkeeping: invalidate removes snapshot entry from dbMap but does not remove it from pendingEvictionQueue if it exists.
Potential snapshot leak: Snapshot close failure during cleanup removes the snapshotID from eviction queue and throws an exception. This causes the snapshot to remain in cache even is refCount = 0 and the snapshot entry remains in dbMap unless
some other thread explicitly invalidates it or references it again. This means SnapshotCache.lock() during this time cannot hold the write lock because lock() expects the cache to be drained.
Write lock leak: Fix write lock leak in SnapshotCache. If the cache drain cleanup(true) throws an exception write lock is not released in SnapshotCache.lock()

Proposed solution:

Handle eviction race appropriately. Log the stale snapshot entry in eviction queue and remove it from the queue.
Remove snapshot entry from eviction queu upon successful invalidation.
Log snapshot close failure during cleanup but do not remove it from eviction queue so that it's cleanup can be retried later.
Catch any unchecked exception during cleanup and release the write lock in SnapshotCache.lock()

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14768

(Please replace this section with the link to the Apache JIRA)

How was this patch tested?

Unit tests.

…iction race.

Copilot

Pull request overview

Fixes SnapshotCache eviction/cleanup edge cases that could previously throw during cleanup, leak the snapshot DB write lock, or leave stale eviction entries behind—improving correctness and reliability of snapshot purge / checkpoint coordination in Ozone Manager.

Changes:

Remove snapshot IDs from the pending eviction queue on invalidate, and tolerate stale eviction entries during cleanup.
Ensure SnapshotCache write lock is released if cleanup throws (including unchecked exceptions).
Adjust cleanup behavior so snapshot close failures are logged and retried later, with added unit tests covering these races/failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java	Updates eviction bookkeeping, cleanup behavior on stale/missing entries and close failures, and hardens lock() to release locks on exceptions.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java	Adds unit tests for stale-eviction cleanup, close-failure retry behavior, and write-lock release when cleanup throws.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

...ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotCache.java

Copilot · 2026-03-09T03:05:22Z