Skip to content

Sync the redfs-rhel9.6 branch to commits from redfs-ubuntu-noble-6.8.0-58.60#96

Merged
bsbernd merged 9 commits intoredfs-rhel9_6-570.12.1from
sync-redfs-rhel9_6-570.12.1
Feb 14, 2026
Merged

Sync the redfs-rhel9.6 branch to commits from redfs-ubuntu-noble-6.8.0-58.60#96
bsbernd merged 9 commits intoredfs-rhel9_6-570.12.1from
sync-redfs-rhel9_6-570.12.1

Conversation

@bsbernd
Copy link
Collaborator

@bsbernd bsbernd commented Feb 13, 2026

Based on

bschubert2@imesrv6 linux.git>compare-branches.py -i .github -i configs --b1 4667450 redfs-rhel9_6-570.12.1 redfs-ubuntu-noble-6.8.0-58.60
Missing from redfs-rhel9_6-570.12.1
32e0073 (Bernd Schubert) fuse: {io-uring} Prefer the current core over mapping
f6de786 (Jingbo Xu) fuse: invalidate the page cache after direct write

  • 1607a03 (Horst Birthelmer) fuse: simplify compound commands
    2e73b0b (Bernd Schubert) fuse: Add retry attempts for numa local queues for load distribution
    c6399ea (Bernd Schubert) fuse: {io-uring} Queue background requests on a different core
  • f18c61e (Horst Birthelmer) fuse: avoid tmp copying of data for writeback pages
    ade0d22 (Matthew Wilcox (Oracle)) fuse: Remove fuse_writepage
    461c4ed (Horst Birthelmer) Revert "fuse: avoid tmp copying of data for writeback pages"
    5e590a6 (Jingbo Xu) fuse: make foffset alignment opt-in for optimum backend performance

Missing from redfs-ubuntu-noble-6.8.0-58.60

  • 8c84323 (Horst Birthelmer) fuse: simplify compound commands
    f5fed0e (Feng Shuo) Fix the compiling error on aarch64
    aea831d (Feng Shuo) Support for RHEL/Rocky 9.7

lostjeffle and others added 6 commits February 14, 2026 00:11
Sometimes the file offset alignment needs to be opt-in to achieve the
optimum performance at the backend store.

For example when ErasureCode [1] is used at the backend store, the
optimum write performance is achieved when the WRITE request is aligned
with the stripe size of ErasureCode.  Otherwise a non-aligned WRITE
request needs to be split at the stripe size boundary.  It is quite
costly to handle these split partial requests, as firstly the whole
stripe to which the split partial request belongs needs to be read out,
then overwrite the read stripe buffer with the request, and finally write
the whole stripe back to the persistent storage.

Thus the backend store can suffer severe performance degradation when
WRITE requests can not fit into one stripe exactly.  The write performance
can be 10x slower when the request is 256KB in size given 4MB stripe size.
Also there can be 50% performance degradation in theory if the request
is not stripe boundary aligned.

Besides, the conveyed test indicates that, the non-alignment issue
becomes more severe when decreasing fuse's max_ratio, maybe partly
because the background writeback now is more likely to run parallelly
with the dirtier.

fuse's max_ratio	ratio of aligned WRITE requests
----------------	-------------------------------
70			99.9%
40			74%
20			45%
10			20%

With the patched version, which makes the alignment constraint opt-in
when constructing WRITE requests, the ratio of aligned WRITE requests
increases to 98% (previously 20%) when fuse's max_ratio is 10.

fuse: fix alignment to work with redfs ubuntu

- small fix to make the fuse alignment patch work with redfs ubuntu 6.8.x
- add writeback_control to fuse_writepage_need_send() to make
more accurate decisions about when to skip sending data
- fix shift number for FUSE_ALIGN_PG_ORDER
- remove test code

[1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit 5e590a6)
This reverts commit 114c4df.

(cherry picked from commit 461c4ed)
The writepage operation is deprecated as it leads to worse performance
under high memory pressure due to folios being written out in LRU order
rather than sequentially within a file.  Use filemap_migrate_folio() to
support dirty folio migration instead of writepage.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit ade0d22)
When writing back pages while using writeback caching the code did a copy of data into
temporary pages to avoid a deadlock in reclaiming of memory.

This is an adaptation and backport of a patch by Joanne Koong joannelkoong@gmail.com.

Since we use pinned memory with io_uring we don't need the temporary copies
and we don't use the AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM flag in the pagemap.

Link: https://www.spinics.net/lists/linux-mm/msg407405.html
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit f18c61e)
Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit c6399ea)
This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.

(cherry picked from commit 2e73b0b)
yongzech
yongzech previously approved these changes Feb 14, 2026
@cding-ddn
Copy link
Collaborator

Seems there is compile error that need to be fixed. Copied from the log of kernel-build-job

fs/fuse/file.c:66:6: error: redefinition of 'fuse_open_args_fill'
   66 | void fuse_open_args_fill(struct fuse_args *args, u64 nodeid, int opcode,
      |      ^~~~~~~~~~~~~~~~~~~
fs/fuse/file.c:33:6: note: previous definition of 'fuse_open_args_fill' with type 'void(struct fuse_args *, u64,  int,  struct fuse_open_in *, struct fuse_open_out *)' {aka 'void(struct fuse_args *, long long unsigned int,  int,  struct fuse_open_in *, struct fuse_open_out *)'}
   33 | void fuse_open_args_fill(struct fuse_args *args, u64 nodeid, int opcode,
      |      ^~~~~~~~~~~~~~~~~~~

hbirth and others added 3 commits February 14, 2026 14:31
Simplify fuse_compound_req to hold only the pointers
to the added fuse args and the request housekeeping.

Simplify open+getattr call by using helper functions
to fill out the fuse request parameters

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit 1607a03)

Note: Empty, as there seem to be compound differences between
branches - the el9.6 branch already has the changes from this
patch. Keeping it will just simplify branch comparison
with https://github.com/bsbernd/compare-git-branches
This fixes xfstests generic/451 (for both O_DIRECT and FOPEN_DIRECT_IO
direct write).

Commit b359af8 ("fuse: Invalidate the page cache after
FOPEN_DIRECT_IO write") tries to fix the similar issue for
FOPEN_DIRECT_IO write, which can be reproduced by xfstests generic/209.
It only fixes the issue for synchronous direct write, while omitting
the case for asynchronous direct write (exactly targeted by
generic/451).

While for O_DIRECT direct write, it's somewhat more complicated.  For
synchronous direct write, generic_file_direct_write() will invalidate
the page cache after the write, and thus it can pass generic/209.  While
for asynchronous direct write, the invalidation in
generic_file_direct_write() is bypassed since the invalidation shall be
done when the asynchronous IO completes.  This is omitted in FUSE and
generic/451 fails whereby.

Fix this by conveying the invalidation for both synchronous and
asynchronous write.

- with FOPEN_DIRECT_IO
  - sync write,  invalidate in fuse_send_write()
  - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO,
		 fuse_send_write() otherwise
- without FOPEN_DIRECT_IO
  - sync write,  invalidate in generic_file_direct_write()
  - async write, invalidate in fuse_aio_complete() with FUSE_ASYNC_DIO,
		 generic_file_direct_write() otherwise

Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
(cherry picked from commit f6de786)
Mapping might point to a totally different core due to
random assignment. For performance using the current
core might be beneficial

Example (with core binding)

unpatched WRITE: bw=841MiB/s
patched   WRITE: bw=1363MiB/s

With
fio --name=test --ioengine=psync --direct=1 \
    --rw=write --bs=1M --iodepth=1 --numjobs=1 \
    --filename_format=/redfs/testfile.\$jobnum --size=100G \
    --thread --create_on_open=1 --runtime=30s --cpus_allowed=1

In order to get the good number `--cpus_allowed=1` is needed.
This could be improved by a future change that avoids
cpu migration in fuse_request_end() on wake_up() call.

(cherry picked from commit 32e0073)
@bsbernd bsbernd force-pushed the sync-redfs-rhel9_6-570.12.1 branch from 73cb091 to dc9fe28 Compare February 14, 2026 13:36
@bsbernd
Copy link
Collaborator Author

bsbernd commented Feb 14, 2026

@cding-ddn @hbirth the compilation issue was due to fuse: simplify compound commands. Actually the el9.6 branch already had these changes, just not as commit. I now kept commit, although empty, to make the branch comparison script happy.

@bsbernd bsbernd merged commit 140d7e7 into redfs-rhel9_6-570.12.1 Feb 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments