Improve zoned (SMR) HDD write throughput by blktests-ci[bot] · Pull Request #583 · linux-blktests/linux-block

Commit 7b29518 ("block: Do not remove zone write plugs still in use") modified disk_should_remove_zone_wplug() to add a check on the reference count of a zone write plug to prevent removing zone write plugs from a disk hash table when the plugs are still being referenced by BIOs or requests in-flight. However, this check does not take into account that a BIO completion may happen right after its submission by a zone write plug BIO work, and before the zone write plug BIO work releases the zone write plug reference count. This situation leads to disk_should_remove_zone_wplug() returning false as in this case the zone write plug reference count is at least equal to 3. If the BIO that completes in such manner transitioned the zone to the FULL condition, the zone write plug for the FULL zone will remain in the disk hash table. Furthermore, relying on a particular value of a zone write plug reference count to set the BLK_ZONE_WPLUG_UNHASHED flag is fragile as reading the atomic reference count and doing a comparison with some value is not overall atomic at all. Address these issues by reworking the reference counting of zone write plugs so that removing plugs from a disk hash table can be done directly from disk_put_zone_wplug() when the last reference on a plug is dropped. To do so, replace the function disk_remove_zone_wplug() with disk_mark_zone_wplug_dead(). This new function sets the zone write plug flag BLK_ZONE_WPLUG_DEAD (which replaces BLK_ZONE_WPLUG_UNHASHED) and drops the initial reference on the zone write plug taken when the plug was added to the disk hash table. This function is called either for zones that are empty or full, or directly in the case of a forced plug removal (e.g. when the disk hash table is being destroyed on disk removal). With this change, disk_should_remove_zone_wplug() is also removed. disk_put_zone_wplug() is modified to call the function disk_free_zone_wplug() to remove a zone write plug from a disk hash table and free the plug structure (with a call_rcu()), when the last reference on a zone write plug is dropped. disk_free_zone_wplug() always checks that the BLK_ZONE_WPLUG_DEAD flag is set. In order to avoid having multiple zone write plugs for the same zone in the disk hash table, disk_get_and_lock_zone_wplug() checked for the BLK_ZONE_WPLUG_UNHASHED flag. This check is removed and a check for the new BLK_ZONE_WPLUG_DEAD flag is added to blk_zone_wplug_handle_write(). With this change, we continue preventing adding multiple zone write plugs for the same zone and at the same time re-inforce checks on the user behavior by failing new incoming write BIOs targeting a zone that is marked as dead. This case can happen only if the user erroneously issues write BIOs to zones that are full, or to zones that are currently being reset or finished. Fixes: 7b29518 ("block: Do not remove zone write plugs still in use") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>

…dule_bio_work() The function disk_zone_wplug_schedule_bio_work() always takes a reference on the zone write plug of the BIO work being scheduled. This ensures that the zone write plug cannot be freed while the BIO work is being scheduled but has not run yet. However, this unconditional reference taking is fragile since the reference taken is released by the BIO work blk_zone_wplug_bio_work() function, which implies that there always must be a 1:1 relation between the work being scheduled and the work running. Make sure to drop the reference taken when scheduling the BIO work if the work is already scheduled, that is, when queue_work() returns false. Fixes: 9e78c38 ("block: Hold a reference on zone write plugs to schedule submission") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

disk_get_and_lock_zone_wplug() always returns a zone write plug with the plug lock held. This is unnecessary since this function does not look at the fields of existing plugs, and new plugs need to be locked only after their insertion in the disk hash table, when they are being used. Remove the zone write plug locking from disk_get_and_lock_zone_wplug() and rename this function disk_get_or_alloc_zone_wplug(). blk_zone_wplug_handle_write() is modified to add locking of the zone write plug after calling disk_get_or_alloc_zone_wplug() and before starting to use the plug. This change also simplifies blk_revalidate_seq_zone() as unlocking the plug becomes unnecessary. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>

The helper function disk_zone_is_full() is only used in disk_zone_wplug_is_full(). So remove it and open code it directly in this single caller. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Rename struct gendisk zone_wplugs_lock field to zone_wplugs_hash_lock to clearly indicates that this is the spinlock used for manipulating the hash table of zone write plugs. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org>

In order to maintain sequential write patterns per zone with zoned block devices, zone write plugging issues only a single write BIO per zone at any time. This works well but has the side effect that when large sequential write streams are issued by the user and these streams cross zone boundaries, the device ends up receiving a discontiguous set of write commands for different zones. The same also happens when a user writes simultaneously at high queue depth multiple zones: the device does not see all sequential writes per zone and receives discontiguous writes to different zones. While this does not affect the performance of solid state zoned block devices, when using an SMR HDD, this pattern change from sequential writes to discontiguous writes to different zones significantly increases head seek which results in degraded write throughput. In order to reduce this seek overhead for rotational media devices, introduce a per disk zone write plugs kernel thread to issue all write BIOs to zones. This single zone write issuing context is enabled for any zoned block device that has a request queue flagged with the new QUEUE_ZONED_QD1_WRITES flag. The flag QUEUE_ZONED_QD1_WRITES is visible as the sysfs queue attribute zoned_qd1_writes for zoned devices. For regular block devices, this attribute is not visible. For zoned block devices, a user can override the default value set to force the global write maximum queue depth of 1 for a zoned block device, or clear this attribute to fallback to the default behavior of zone write plugging which limits writes to QD=1 per sequential zone. Writing to a zoned block device flagged with QUEUE_ZONED_QD1_WRITES is implemented using a list of zone write plugs that have a non-empty BIO list. Listed zone write plugs are processed by the disk zone write plugs worker kthread in FIFO order, and all BIOs of a zone write plug are all processed before switching to the next listed zone write plug. A newly submitted BIO for a non-FULL zone write plug that is not yet listed causes the addition of the zone write plug at the end of the disk list of zone write plugs. Since the write BIOs queued in a zone write plug BIO list are necessarilly sequential, for rotational media, using the single zone write plugs kthread to issue all BIOs maintains a sequential write pattern and thus reduces seek overhead and improves write throughput. This processing essentially result in always writing to HDDs at QD=1, which is not an issue for HDDs operating with write caching enabled. Performance with write cache disabled is also not degraded thanks to the efficient write handling of modern SMR HDDs. A disk list of zone write plugs is defined using the new struct gendisk zone_wplugs_list, and accesses to this list is protected using the zone_wplugs_list_lock spinlock. The per disk kthread (zone_wplugs_worker) code is implemented by the function disk_zone_wplugs_worker(). A reference on listed zone write plugs is always held until all BIOs of the zone write plug are processed by the worker kthread. BIO issuing at QD=1 is driven using a completion structure (zone_wplugs_worker_bio_done) and calls to blk_io_wait(). With this change, performance when sequentially writing the zones of a 30 TB SMR SATA HDD connected to an AHCI adapter changes as follows (1MiB direct I/Os, results in MB/s unit): +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Sequential write | Baseline | Patched | | Queue Depth | 6.19-rc8 | | +------------------+----------+---------+ | 1 | 244 | 245 | | 2 | 244 | 245 | | 4 | 245 | 245 | | 8 | 242 | 245 | | 16 | 222 | 246 | | 32 | 211 | 245 | | 64 | 193 | 244 | | 128 | 112 | 246 | +------------------+----------+---------+ With the current code (baseline), as the sequential write stream crosses a zone boundary, higher queue depth creates a gap between the last IO to the previous zone and the first IOs to the following zones, causing head seeks and degrading performance. Using the disk zone write plugs worker thread, this pattern disappears and the maximum throughput of the drive is maintained, leading to over 100% improvements in throughput for high queue depth write. Using 16 fio jobs all writing to randomly chosen zones at QD=32 with 1 MiB direct IOs, write throughput also increases significantly. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Random write | Baseline | Patched | | Number of zones | 6.19-rc7 | | +------------------+----------+---------+ | 1 | 191 | 192 | | 2 | 101 | 128 | | 4 | 115 | 123 | | 8 | 90 | 120 | | 16 | 64 | 115 | | 32 | 58 | 105 | | 64 | 56 | 101 | | 128 | 55 | 99 | +------------------+----------+---------+ Tests using XFS shows that buffered write speed with 8 jobs writing files increases by 12% to 35% depending on the workload. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Workload | Baseline | Patched | | | 6.19-rc7 | | +------------------+----------+---------+ | 256MiB file size | 212 | 238 | +------------------+----------+---------+ | 4MiB .. 128 MiB | 213 | 243 | | random file size | | | +------------------+----------+---------+ | 2MiB .. 8 MiB | 179 | 242 | | random file size | | | +------------------+----------+---------+ Performance gains are even more significant when using an HBA that limits the maximum size of commands to a small value, e.g. HBAs controlled with the mpi3mr driver limit commands to a maximum of 1 MiB. In such case, the write throughput gains are over 40%. +--------------------+ | Write BW (MB/s) | +------------------+----------+---------+ | Workload | Baseline | Patched | | | 6.19-rc7 | | +------------------+----------+---------+ | 256MiB file size | 175 | 245 | +------------------+----------+---------+ | 4MiB .. 128 MiB | 174 | 244 | | random file size | | | +------------------+----------+---------+ | 2MiB .. 8 MiB | 171 | 243 | | random file size | | | +------------------+----------+---------+ Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

For blk-mq rotational zoned block devices (e.g. SMR HDDs), default to having zone write plugging limit write operations to a maximum queue depth of 1 for all zones. This significantly reduce write seek overhead and improves SMR HDD write throughput. For remotely connected disks with a very high network latency this features might not be useful. However, remotely connected zoned devices are rare at the moment, and we cannot know the round trip latency to pick a good default for network attached devices. System administrators can however disable this feature in that case. For BIO based (non blk-mq) rotational zoned block devices, the device driver (e.g. a DM target driver) can directly set an appropriate default. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Update the documentation file Documentation/ABI/stable/sysfs-block to describe the zoned_qd1_writes sysfs queue attribute file. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

blktests-ci · 2026-03-11T08:15:43Z

Upstream branch: None
series: https://patchwork.kernel.org/project/linux-block/list/?series=1058980
version: 4

blktests-ci bot added new V1 linus-master V1-ci-fail labels Feb 21, 2026

blktests-ci bot force-pushed the linus-master_base branch from df85678 to 50e7070 Compare February 22, 2026 05:34

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 62da51c to 7fcb4db Compare February 22, 2026 05:36

blktests-ci bot added V1-ci-pass and removed V1-ci-fail labels Feb 22, 2026

blktests-ci bot force-pushed the linus-master_base branch from 50e7070 to c90f83b Compare February 23, 2026 10:11

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 7fcb4db to 44b6aee Compare February 23, 2026 10:14

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 44b6aee to d3e17fc Compare February 23, 2026 12:00

blktests-ci bot force-pushed the series/1056048=>linus-master branch from d3e17fc to ffd513e Compare February 23, 2026 12:10

blktests-ci bot force-pushed the series/1056048=>linus-master branch from ffd513e to acb53d8 Compare February 23, 2026 12:19

blktests-ci bot added V1-ci-fail and removed V1-ci-pass labels Feb 23, 2026

blktests-ci bot force-pushed the series/1056048=>linus-master branch from acb53d8 to 4429330 Compare February 24, 2026 13:20

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 4429330 to 92e7e75 Compare February 24, 2026 13:29

blktests-ci bot force-pushed the linus-master_base branch from c90f83b to c475e20 Compare February 25, 2026 11:14

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 92e7e75 to 5376e4e Compare February 25, 2026 11:17

blktests-ci bot removed the V1 label Feb 26, 2026

blktests-ci bot force-pushed the linus-master_base branch 2 times, most recently from ecd10e2 to d0e1bed Compare March 4, 2026 07:45

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 303a4ac to cee5fc7 Compare March 4, 2026 07:54

blktests-ci bot force-pushed the linus-master_base branch from d0e1bed to 6b51c57 Compare March 4, 2026 09:34

blktests-ci bot force-pushed the series/1056048=>linus-master branch from cee5fc7 to be8d8b8 Compare March 4, 2026 09:38

blktests-ci bot force-pushed the linus-master_base branch from 6b51c57 to 78036b2 Compare March 4, 2026 19:57

blktests-ci bot force-pushed the series/1056048=>linus-master branch from be8d8b8 to bf9fd1e Compare March 4, 2026 20:00

blktests-ci bot force-pushed the linus-master_base branch from 78036b2 to bbb3394 Compare March 5, 2026 12:20

blktests-ci bot force-pushed the series/1056048=>linus-master branch from bf9fd1e to 45dacdc Compare March 5, 2026 12:23

blktests-ci bot force-pushed the linus-master_base branch from bbb3394 to 901a429 Compare March 5, 2026 21:37

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 45dacdc to 2d5bfe9 Compare March 5, 2026 21:40

blktests-ci bot force-pushed the linus-master_base branch from 901a429 to 1f19ba6 Compare March 10, 2026 06:29

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 2d5bfe9 to 6cecaac Compare March 10, 2026 06:41

blktests-ci bot force-pushed the linus-master_base branch from 1f19ba6 to e79276a Compare March 11, 2026 08:02

damien-lemoal added 8 commits March 11, 2026 17:15

blktests-ci bot force-pushed the series/1056048=>linus-master branch from 6cecaac to 1cdf2f3 Compare March 11, 2026 08:15

Conversation

blktests-ci bot commented Feb 21, 2026

Uh oh!

blktests-ci bot commented Feb 21, 2026

Uh oh!

blktests-ci bot commented Feb 22, 2026

Uh oh!

blktests-ci bot commented Feb 23, 2026

Uh oh!

blktests-ci bot commented Feb 23, 2026

Uh oh!

blktests-ci bot commented Feb 23, 2026

Uh oh!

blktests-ci bot commented Feb 23, 2026

Uh oh!

blktests-ci bot commented Feb 24, 2026

Uh oh!

blktests-ci bot commented Feb 24, 2026

Uh oh!

blktests-ci bot commented Feb 25, 2026

Uh oh!

blktests-ci bot commented Feb 26, 2026

Uh oh!

blktests-ci bot commented Mar 4, 2026

Uh oh!

blktests-ci bot commented Mar 4, 2026

Uh oh!

blktests-ci bot commented Mar 4, 2026

Uh oh!

blktests-ci bot commented Mar 5, 2026

Uh oh!

blktests-ci bot commented Mar 5, 2026

Uh oh!

blktests-ci bot commented Mar 10, 2026

Uh oh!

blktests-ci bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant