Sync the redfs-rhel9.6 branch to commits from redfs-ubuntu-noble-6.8.0-58.60 by bsbernd · Pull Request #96 · DDNStorage/linux

bsbernd · 2026-02-13T23:14:54Z

Based on

bschubert2@imesrv6 linux.git>compare-branches.py -i .github -i configs --b1 4667450 redfs-rhel9_6-570.12.1 redfs-ubuntu-noble-6.8.0-58.60
Missing from redfs-rhel9_6-570.12.1
32e0073 (Bernd Schubert) fuse: {io-uring} Prefer the current core over mapping
f6de786 (Jingbo Xu) fuse: invalidate the page cache after direct write

1607a03 (Horst Birthelmer) fuse: simplify compound commands
2e73b0b (Bernd Schubert) fuse: Add retry attempts for numa local queues for load distribution
c6399ea (Bernd Schubert) fuse: {io-uring} Queue background requests on a different core
f18c61e (Horst Birthelmer) fuse: avoid tmp copying of data for writeback pages
ade0d22 (Matthew Wilcox (Oracle)) fuse: Remove fuse_writepage
461c4ed (Horst Birthelmer) Revert "fuse: avoid tmp copying of data for writeback pages"
5e590a6 (Jingbo Xu) fuse: make foffset alignment opt-in for optimum backend performance

Missing from redfs-ubuntu-noble-6.8.0-58.60

8c84323 (Horst Birthelmer) fuse: simplify compound commands
f5fed0e (Feng Shuo) Fix the compiling error on aarch64
aea831d (Feng Shuo) Support for RHEL/Rocky 9.7

Sometimes the file offset alignment needs to be opt-in to achieve the optimum performance at the backend store. For example when ErasureCode [1] is used at the backend store, the optimum write performance is achieved when the WRITE request is aligned with the stripe size of ErasureCode. Otherwise a non-aligned WRITE request needs to be split at the stripe size boundary. It is quite costly to handle these split partial requests, as firstly the whole stripe to which the split partial request belongs needs to be read out, then overwrite the read stripe buffer with the request, and finally write the whole stripe back to the persistent storage. Thus the backend store can suffer severe performance degradation when WRITE requests can not fit into one stripe exactly. The write performance can be 10x slower when the request is 256KB in size given 4MB stripe size. Also there can be 50% performance degradation in theory if the request is not stripe boundary aligned. Besides, the conveyed test indicates that, the non-alignment issue becomes more severe when decreasing fuse's max_ratio, maybe partly because the background writeback now is more likely to run parallelly with the dirtier. fuse's max_ratio ratio of aligned WRITE requests ---------------- ------------------------------- 70 99.9% 40 74% 20 45% 10 20% With the patched version, which makes the alignment constraint opt-in when constructing WRITE requests, the ratio of aligned WRITE requests increases to 98% (previously 20%) when fuse's max_ratio is 10. fuse: fix alignment to work with redfs ubuntu - small fix to make the fuse alignment patch work with redfs ubuntu 6.8.x - add writeback_control to fuse_writepage_need_send() to make more accurate decisions about when to skip sending data - fix shift number for FUSE_ALIGN_PG_ORDER - remove test code [1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4 Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit 5e590a6)

This reverts commit 114c4df. (cherry picked from commit 461c4ed)

The writepage operation is deprecated as it leads to worse performance under high memory pressure due to folios being written out in LRU order rather than sequentially within a file. Use filemap_migrate_folio() to support dirty folio migration instead of writepage. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit ade0d22)

When writing back pages while using writeback caching the code did a copy of data into temporary pages to avoid a deadlock in reclaiming of memory. This is an adaptation and backport of a patch by Joanne Koong joannelkoong@gmail.com. Since we use pinned memory with io_uring we don't need the temporary copies and we don't use the AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM flag in the pagemap. Link: https://www.spinics.net/lists/linux-mm/msg407405.html Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit f18c61e)

Running background IO on a different core makes quite a difference. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=272MiB/s (285MB/s) ... patched READ: bw=650MiB/s (682MB/s) Reason is easily visible, the fio process is migrating between CPUs when requests are submitted on the queue for the same core. With --iodepth=8 unpatched READ: bw=466MiB/s (489MB/s) patched READ: bw=641MiB/s (672MB/s) Without io-uring (--iodepth=8) READ: bw=729MiB/s (764MB/s) Without fuse (--iodepth=8) READ: bw=2199MiB/s (2306MB/s) (Test were done with <libfuse>/example/passthrough_hp -o allow_other --nopassthrough \ [-o io_uring] /tmp/source /tmp/dest ) Additional notes: With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8) READ: bw=903MiB/s (946MB/s) With just a random qid (--iodepth=8) READ: bw=429MiB/s (450MB/s) With --iodepth=1 unpatched READ: bw=195MiB/s (204MB/s) patched READ: bw=232MiB/s (243MB/s) With --iodepth=1 --numjobs=2 unpatched READ: bw=366MiB/s (384MB/s) patched READ: bw=472MiB/s (495MB/s) With --iodepth=1 --numjobs=8 unpatched READ: bw=1437MiB/s (1507MB/s) patched READ: bw=1529MiB/s (1603MB/s) fuse without io-uring READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ... no-fuse READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ... In summary, for async requests the core doing application IO is busy sending requests and processing IOs should be done on a different core. Spreading the load on random cores is also not desirable, as the core might be frequency scaled down and/or in C1 sleep states. Not shown here, but differnces are much smaller when the system uses performance govenor instead of schedutil (ubuntu default). Obviously at the cost of higher system power consumption for performance govenor - not desirable either. Results without io-uring (which uses fixed libfuse threads per queue) heavily depend on the current number of active threads. Libfuse uses default of max 10 threads, but actual nr max threads is a parameter. Also, no-fuse-io-uring results heavily depend on, if there was already running another workload before, as libfuse starts these threads dynamically - i.e. the more threads are active, the worse the performance. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit c6399ea)

This is to further improve performance. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=650MiB/s (682MB/s) patched: READ: bw=995MiB/s (1043MB/s) with --iodepth=8 unpatched READ: bw=641MiB/s (672MB/s) patched READ: bw=966MiB/s (1012MB/s) Reason is that with --iodepth=x (x > 1) fio submits multiple async requests and a single queue might become CPU limited. I.e. spreading the load helps. (cherry picked from commit 2e73b0b)

cding-ddn · 2026-02-14T08:41:35Z

Seems there is compile error that need to be fixed. Copied from the log of kernel-build-job

fs/fuse/file.c:66:6: error: redefinition of 'fuse_open_args_fill'
   66 | void fuse_open_args_fill(struct fuse_args *args, u64 nodeid, int opcode,
      |      ^~~~~~~~~~~~~~~~~~~
fs/fuse/file.c:33:6: note: previous definition of 'fuse_open_args_fill' with type 'void(struct fuse_args *, u64,  int,  struct fuse_open_in *, struct fuse_open_out *)' {aka 'void(struct fuse_args *, long long unsigned int,  int,  struct fuse_open_in *, struct fuse_open_out *)'}
   33 | void fuse_open_args_fill(struct fuse_args *args, u64 nodeid, int opcode,
      |      ^~~~~~~~~~~~~~~~~~~

Simplify fuse_compound_req to hold only the pointers to the added fuse args and the request housekeeping. Simplify open+getattr call by using helper functions to fill out the fuse request parameters Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit 1607a03) Note: Empty, as there seem to be compound differences between branches - the el9.6 branch already has the changes from this patch. Keeping it will just simplify branch comparison with https://github.com/bsbernd/compare-git-branches

This fixes xfstests generic/451 (for both O_DIRECT and FOPEN_DIRECT_IO direct write). Commit b359af8 ("fuse: Invalidate the page cache after FOPEN_DIRECT_IO write") tries to fix the similar issue for FOPEN_DIRECT_IO write, which can be reproduced by xfstests generic/209. It only fixes the issue for synchronous direct write, while omitting the case for asynchronous direct write (exactly targeted by generic/451). While for O_DIRECT direct write, it's somewhat more complicated. For synchronous direct write, generic_file_direct_write() will invalidate the page cache after the write, and thus it can pass generic/209. While for asynchronous direct write, the invalidation in generic_file_direct_write() is bypassed since the invalidation shall be done when the asynchronous IO completes. This is omitted in FUSE and generic/451 fails whereby. Fix this by conveying the invalidation for both synchronous and asynchronous write. - with FOPEN_DIRECT_IO - sync write, invalidate in fuse_send_write() - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO, fuse_send_write() otherwise - without FOPEN_DIRECT_IO - sync write, invalidate in generic_file_direct_write() - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO, generic_file_direct_write() otherwise Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> (cherry picked from commit f6de786)

Mapping might point to a totally different core due to random assignment. For performance using the current core might be beneficial Example (with core binding) unpatched WRITE: bw=841MiB/s patched WRITE: bw=1363MiB/s With fio --name=test --ioengine=psync --direct=1 \ --rw=write --bs=1M --iodepth=1 --numjobs=1 \ --filename_format=/redfs/testfile.\$jobnum --size=100G \ --thread --create_on_open=1 --runtime=30s --cpus_allowed=1 In order to get the good number `--cpus_allowed=1` is needed. This could be improved by a future change that avoids cpu migration in fuse_request_end() on wake_up() call. (cherry picked from commit 32e0073)

bsbernd · 2026-02-14T14:01:30Z

@cding-ddn @hbirth the compilation issue was due to fuse: simplify compound commands. Actually the el9.6 branch already had these changes, just not as commit. I now kept commit, although empty, to make the branch comparison script happy.

lostjeffle and others added 6 commits February 14, 2026 00:11

Revert "fuse: avoid tmp copying of data for writeback pages"

26bf669

This reverts commit 114c4df. (cherry picked from commit 461c4ed)

bsbernd requested review from cding-ddn, hbirth, openunix and yongzech February 13, 2026 23:15

yongzech previously approved these changes Feb 14, 2026

View reviewed changes

bsbernd dismissed yongzech’s stale review via 73cb091 February 14, 2026 13:31

bsbernd force-pushed the sync-redfs-rhel9_6-570.12.1 branch from 4619351 to 73cb091 Compare February 14, 2026 13:31

hbirth and others added 3 commits February 14, 2026 14:31

bsbernd force-pushed the sync-redfs-rhel9_6-570.12.1 branch from 73cb091 to dc9fe28 Compare February 14, 2026 13:36

bsbernd merged commit 140d7e7 into redfs-rhel9_6-570.12.1 Feb 14, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync the redfs-rhel9.6 branch to commits from redfs-ubuntu-noble-6.8.0-58.60#96

Sync the redfs-rhel9.6 branch to commits from redfs-ubuntu-noble-6.8.0-58.60#96
bsbernd merged 9 commits intoredfs-rhel9_6-570.12.1from
sync-redfs-rhel9_6-570.12.1

bsbernd commented Feb 13, 2026

Uh oh!

cding-ddn commented Feb 14, 2026

Uh oh!

bsbernd commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

bsbernd commented Feb 13, 2026

Uh oh!

cding-ddn commented Feb 14, 2026

Uh oh!

bsbernd commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments