blk-cgroup: fix races related to blkg_list iteration#615
blk-cgroup: fix races related to blkg_list iteration#615blktests-ci[bot] wants to merge 7 commits intolinus-master_basefrom
Conversation
|
Upstream branch: af4e9ef |
d0e1bed to
6b51c57
Compare
|
Upstream branch: 0031c06 |
5a8adaf to
550a1b4
Compare
6b51c57 to
78036b2
Compare
|
Upstream branch: ecc64d2 |
550a1b4 to
9c2054a
Compare
78036b2 to
bbb3394
Compare
|
Upstream branch: c107785 |
9c2054a to
5269d6b
Compare
bbb3394 to
901a429
Compare
|
Upstream branch: 5ee8dbf |
5269d6b to
4c7632e
Compare
901a429 to
1f19ba6
Compare
|
Upstream branch: 1f318b9 |
4c7632e to
0e9f175
Compare
1f19ba6 to
e79276a
Compare
|
Upstream branch: None |
… blkcg_mutex blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex, which can race with blkg_free_workfn() that removes blkgs from the list while holding blkcg_mutex. Add blkcg_mutex protection around the q->blkg_list iteration to prevent potential list corruption or use-after-free issues. Signed-off-by: Yu Kuai <yukuai@fnnas.com>
…mutex bfq_end_wr_async() iterates q->blkg_list while only holding bfqd->lock, but not blkcg_mutex. This can race with blkg_free_workfn() that removes blkgs from the list while holding blkcg_mutex. Add blkcg_mutex protection in bfq_end_wr() before taking bfqd->lock to ensure proper synchronization when iterating q->blkg_list. Signed-off-by: Yu Kuai <yukuai@fnnas.com>
When switching an IO scheduler on a block device, blkcg_activate_policy() allocates blkg_policy_data (pd) for all blkgs attached to the queue. However, blkcg_activate_policy() may race with concurrent blkcg deletion, leading to use-after-free and memory leak issues. The use-after-free occurs in the following race: T1 (blkcg_activate_policy): - Successfully allocates pd for blkg1 (loop0->queue, blkcgA) - Fails to allocate pd for blkg2 (loop0->queue, blkcgB) - Enters the enomem rollback path to release blkg1 resources T2 (blkcg deletion): - blkcgA is deleted concurrently - blkg1 is freed via blkg_free_workfn() - blkg1->pd is freed T1 (continued): - Rollback path accesses blkg1->pd->online after pd is freed - Triggers use-after-free In addition, blkg_free_workfn() frees pd before removing the blkg from q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd for a blkg that is being destroyed, leaving the newly allocated pd unreachable when the blkg is finally freed. Fix these races by extending blkcg_mutex coverage to serialize blkcg_activate_policy() rollback and blkg destruction, ensuring pd lifecycle is synchronized with blkg list visibility. Link: https://lore.kernel.org/all/20260108014416.3656493-3-zhengqixing@huaweicloud.com/ Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.
T1: T2:
blkg_destroy
kill(&blkg->refcnt) // blkg->refcnt=1->0
blkg_release // call_rcu(__blkg_release)
...
blkg_free_workfn
->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf
Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.
Link: https://lore.kernel.org/all/20260108014416.3656493-4-zhengqixing@huaweicloud.com/
Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Move the teardown sequence which offlines and frees per-policy blkg_policy_data (pd) into a helper for readability. No functional change intended. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
…cy() Some policies like iocost and iolatency perform percpu allocation in pd_alloc_fn(). Percpu allocation with queue frozen can cause deadlock because percpu memory reclaim may issue IO. Now that q->blkg_list is protected by blkcg_mutex, restructure blkcg_activate_policy() to allocate all pds before freezing the queue: 1. Allocate all pds with GFP_KERNEL before freezing the queue 2. Freeze the queue 3. Initialize and online all pds Note: Future work is to remove all queue freezing before blkcg_activate_policy() to fix the deadlocks thoroughly. Signed-off-by: Yu Kuai <yukuai@fnnas.com>
The current rq_qos_mutex handling has an awkward pattern where callers must acquire the mutex before calling rq_qos_add()/rq_qos_del(), and blkg_conf_open_bdev_frozen() had to release and re-acquire the mutex around queue freezing to maintain proper locking order (freeze queue before mutex). On the other hand, with rq_qos_mutex held after blkg_conf_prep(), there are many possible deadlocks: - allocating memory with GFP_KERNEL, like blk_throtl_init(); - allocating percpu memory, like pd_alloc_fn() for iocost/iolatency; This patch refactors the locking by: 1. Moving queue freeze and rq_qos_mutex acquisition inside rq_qos_add()/rq_qos_del(), with the correct order: freeze first, then acquire mutex. 2. Removing external mutex handling from wbt_init() since rq_qos_add() now handles it internally. 3. Removing rq_qos_mutex handling from blkg_conf_open_bdev() entirely, making it only responsible for parsing MAJ:MIN and opening the bdev. 4. Removing blkg_conf_open_bdev_frozen() and blkg_conf_exit_frozen() functions which are no longer needed. 5. Updating ioc_qos_write() to use the simpler blkg_conf_open_bdev() and blkg_conf_exit() functions. This eliminates the release-and-reacquire pattern and makes rq_qos_add()/rq_qos_del() self-contained, which is cleaner and reduces complexity. Each function now properly manages its own locking with the correct order: queue freeze → mutex acquire → modify → mutex release → queue unfreeze. Signed-off-by: Yu Kuai <yukuai@fnnas.com>
0e9f175 to
7388f0b
Compare
Pull request for series with
subject: blk-cgroup: fix races related to blkg_list iteration
version: 3
url: https://patchwork.kernel.org/project/linux-block/list/?series=1061077