Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

Merged
merged 1 commit into from
Nov 6, 2023

Conversation

pcc
Copy link
Contributor

@pcc pcc commented Jun 28, 2023

…evel

The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level.

I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way.

@pcc pcc force-pushed the pte-per-level branch from 0f180b6 to ff688c8 Compare June 28, 2023 01:19
@brenns10
Copy link
Contributor

I'm a bit confused, how are you debugging a remote target using drgn? Do you mean a vmcore on a remote filesystem?

I only ask because in my experience, /proc/kcore tends to be high latency, but not terribly low bandwidth. Normally performance optimizations involve reading more data at a time, with fewer read requests. If I understand this correctly, you're looking to do the reverse: read less data per request.

So I guess I'm curious what situation leads to this particular set of constraints you're facing?

@pcc
Copy link
Contributor Author

pcc commented Jun 28, 2023

I'm utilizing an SWD connection to the target via OpenOCD. I have a drgn patch to add OpenOCD support that I intend to share in the next few days, once I've removed some hardcoding for my target. For various reasons it isn't always feasible to remote mount my target's /proc/kcore, so I didn't consider that route. SWD also enables some additional capabilities as compared to /proc/kcore, such as being able to access MMIO registers on the target, and being able to debug non-Linux operating systems and bare-metal microcontroller firmware.

SWD connections tend to be fairly low bandwidth. For example, a common maximum adapter frequency is in the 10 MHz range. Taking into account the protocol overhead we have a theoretical maximum data transfer speed of 10 Mb/s * 32 / (32 + 15) = 6.81 Mbps = 851 KiB/s. But this is only theoretical: the design of the protocol between the host and the adapter can have a big impact. In my experience, data transfer speeds can vary between 10% and 50% of the theoretical maximum depending on the adapter protocol. So with the existing code, in the worst case this translates to on average (2*3)/85.1= 0.07s to transfer a 3-level page table, not taking into account round trip time.

With drgn's typical usage model, we have a large number of transfers of small amounts of data which each individually require a page table transfer, so it makes sense to optimize for the case where only a single PTE is required at each level.

@osandov
Copy link
Owner

osandov commented Jun 29, 2023

As I alluded to in #315 and @brenns10 mentioned, this is done intentionally because /proc/kcore has a horrible per-read overhead (see also #106). linux_kernel_pgtable_iterator_next_x86_64() even has a comment justifying this that I failed to copy over to the AArch64 version. Doing it this way would be a huge regression for the /proc/kcore case, and vice versa, so although it will complicate things, we might need to choose the strategy based on whether the target is local or remote. Alternatively, we could give the page table iterator interface a hint about how much it should "read ahead".

@pcc
Copy link
Contributor Author

pcc commented Jun 30, 2023

Thanks for taking a look. I think it would be worth measuring the impact in practice; in my experience drgn typically does a large number of reads of small size (word size or less), which means that the additional PTEs loaded by the existing code are almost never needed.

I don't currently have a setup for debugging the kernel with a local /proc/kcore on aarch64. I will try to set that up and collect some performance numbers.

@osandov
Copy link
Owner

osandov commented Jun 30, 2023

That's fair, you're right that the vast majority of reads are smaller than a page and thus don't benefit from the extra reads at all. I think the readahead hint idea is the most optimal across the board, but it could be complicated. I can take a stab at implementing it, and if it's too difficult, I think this patch would be okay.

@pcc
Copy link
Contributor Author

pcc commented Jun 30, 2023

I measured the performance on an Apple M1 Mac Mini running an Asahi Linux kernel (commit b5c05cbffb0488c7618106926d522cc3b43d93d5 from Asahi kernel repo) by running this 10 times and compared the results without and with this patch.

timeit.timeit(lambda: fs.path_lookup(prog, '/sys/bus/platform/devices'), number=10000)

The results were (x = without patch, + = with patch):

    N           Min           Max        Median           Avg        Stddev
x  10     5.2588345     5.3796673     5.2976387     5.3028653   0.034026569
+  10     5.1722865     5.3094247     5.2717144     5.2574367   0.044795798
Difference at 95.0% confidence
        -0.0454286 +/- 0.0373746
        -0.85668% +/- 0.702597%
        (Student's t, pooled s = 0.0397773)

So it looks like we're typically slightly faster with this patch, even in the local case.

@osandov
Copy link
Owner

osandov commented Jun 30, 2023

Hm, is the page table iterator being called for this test case? I would expect that all of these memory reads would come from /proc/kcore, which is added here:

err = drgn_program_add_memory_segment(prog, phdr->p_vaddr,

@pcc
Copy link
Contributor Author

pcc commented Jun 30, 2023

Right, it isn't being called, as you also pointed out in #310. So the above was just measurement noise.

So what do you reckon would be a good test case for this? Maybe we can read a userspace process's memory?

@osandov
Copy link
Owner

osandov commented Jun 30, 2023

cmdline() and environ() read via the page table under the hood and are fairly realistic use cases, so those might make for a good test. Something like:

import os
task = find_task(prog, os.getpid())
# Then use timeit to measure cmdline(task) and environ(task)

@pcc
Copy link
Contributor Author

pcc commented Jun 30, 2023

Thanks, confirmed that environ(task) does end up calling the page table iterator. I did:

timeit.timeit(lambda: environ(task), number=1000000)

With the PR as uploaded, we were a bit slower with this patch:

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     5.3784772     5.5665043     5.4984312     5.4838718   0.064820429

This was a bit of a surprising result because linux_kernel_pgtable_iterator_next_aarch64 was only being called once for environ(task). I investigated and looks like it was due to the memset on line 470 of my patch clearing the same memory multiple times, which I previously thought would be insignificant compared to actually reading the page table. As a test I removed it and we were faster with this patch.

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     4.6734759     4.8093641     4.7221596     4.7194108    0.04111319
Difference at 95.0% confidence
        -0.346312 +/- 0.0330537
        -6.83637% +/- 0.638692%
        (Student's t, pooled s = 0.0351787)

Removing the memset isn't correct in general so I'll look at a proper fix.

@pcc pcc force-pushed the pte-per-level branch from ff688c8 to 0531018 Compare June 30, 2023 21:24
@pcc
Copy link
Contributor Author

pcc commented Jun 30, 2023

With the patch I just uploaded the results look like this:

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     4.1080728     4.2879725     4.1325089     4.1497191   0.053896147
Difference at 95.0% confidence
        -0.916003 +/- 0.0403566
        -18.0824% +/- 0.768284%
        (Student's t, pooled s = 0.042951)

@osandov
Copy link
Owner

osandov commented Oct 10, 2023

The other case to test that I should've mentioned is a large read. I implemented the equivalent to this PR for x86-64:

 libdrgn/arch_x86_64.c | 70 ++++++++++++++++++---------------------------------
 1 file changed, 24 insertions(+), 46 deletions(-)

diff --git a/libdrgn/arch_x86_64.c b/libdrgn/arch_x86_64.c
index 2af98f70..f1c554c8 100644
--- a/libdrgn/arch_x86_64.c
+++ b/libdrgn/arch_x86_64.c
@@ -600,11 +600,10 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
 	static const uint64_t PSE = 0x80; /* a.k.a. huge page */
 	static const uint64_t ADDRESS_MASK = UINT64_C(0xffffffffff000);
 	struct drgn_error *err;
-	bool bswap = drgn_platform_bswap(&prog->platform);
 	struct pgtable_iterator_x86_64 *it =
 		container_of(_it, struct pgtable_iterator_x86_64, it);
 	uint64_t virt_addr = it->it.virt_addr;
-	int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4, level;
+	int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4;
 
 	uint64_t start_non_canonical =
 		(UINT64_C(1) <<
@@ -619,52 +618,31 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
 		return NULL;
 	}
 
-	/* Find the lowest level with cached entries. */
-	for (level = 0; level < levels; level++) {
-		if (it->index[level] < array_size(it->table[level]))
-			break;
-	}
-	/* For every level below that, refill the cache/return pages. */
-	for (;; level--) {
-		uint64_t table;
-		bool table_physical;
-		uint16_t index;
-		if (level == levels) {
-			table = it->it.pgtable;
-			table_physical = false;
-		} else {
-			uint64_t entry = it->table[level][it->index[level]++];
-			if (bswap)
-				entry = bswap_64(entry);
-			table = entry & ADDRESS_MASK;
-			if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
-				uint64_t mask = (UINT64_C(1) <<
-						 (PAGE_SHIFT +
-						  PGTABLE_SHIFT * level)) - 1;
-				*virt_addr_ret = virt_addr & ~mask;
-				if (entry & PRESENT)
-					*phys_addr_ret = table & ~mask;
-				else
-					*phys_addr_ret = UINT64_MAX;
-				it->it.virt_addr = (virt_addr | mask) + 1;
-				return NULL;
-			}
-			table_physical = true;
-		}
-		index = (virt_addr >>
-			 (PAGE_SHIFT + PGTABLE_SHIFT * (level - 1))) & PGTABLE_MASK;
-		/*
-		 * It's only marginally more expensive to read 4096 bytes than 8
-		 * bytes, so we always read to the end of the table.
-		 */
-		err = drgn_program_read_memory(prog,
-					       &it->table[level - 1][index],
-					       table + 8 * index,
-					       sizeof(it->table[0]) - 8 * index,
-					       table_physical);
+	uint64_t table = it->it.pgtable;
+	bool table_physical = false;
+	for (int level = levels - 1; ; level--) {
+		uint16_t index = ((virt_addr >>
+				  (PAGE_SHIFT + PGTABLE_SHIFT * level))
+				  & PGTABLE_MASK);
+		uint64_t entry;
+		err = drgn_program_read_u64(prog, table + 8 * index,
+					    table_physical, &entry);
 		if (err)
 			return err;
-		it->index[level - 1] = index;
+		if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
+			uint64_t mask = (UINT64_C(1) <<
+					 (PAGE_SHIFT +
+					  PGTABLE_SHIFT * level)) - 1;
+			*virt_addr_ret = virt_addr & ~mask;
+			if (entry & PRESENT)
+				*phys_addr_ret = entry & ADDRESS_MASK & ~mask;
+			else
+				*phys_addr_ret = UINT64_MAX;
+			it->it.virt_addr = (virt_addr | mask) + 1;
+			return NULL;
+		}
+		table = entry & ADDRESS_MASK;
+		table_physical = true;
 	}
 }
 

And timed reading a 512 MB chunk from userspace while attached to /proc/kcore:

import ctypes
import mmap
import os
import time

from drgn.helpers.linux.mm import access_remote_vm
from drgn.helpers.linux.pid import find_task

size = 512 * 1024 * 1024
map = mmap.mmap(-1, size, flags=mmap.MAP_PRIVATE | mmap.MAP_POPULATE)
address = ctypes.addressof(ctypes.c_char.from_buffer(map))
mm = find_task(prog, os.getpid()).mm

start = time.monotonic()
access_remote_vm(mm, address, size)
print(time.monotonic() - start)

This took about 170 ms with the original code, and 300 ms with the change to read one PTE at a time.

This isn't super contrived: I've needed to do a large read so I could sift through it for certain values more efficiently before. So I think there is still value in reading additional PTEs, although your use case makes it clear that we shouldn't read any more than are strictly necessary.

I will take a stab at implementing a readahead hint for x86-64 tomorrow, then likely ask for your help for AArch64. If it adds too much complexity, then I'll drop it and go with this approach.

osandov added a commit that referenced this pull request Oct 11, 2023
Peter Collingbourne reported that the over-reading we do in the AArch64
page table iterator uses too much bandwidth for remote targets. His
original proposal in #312 was to change the page table iterator to only
read one entry per level. However, this would regress large reads that
do end up using the additional entries (in particular when the target is
/proc/kcore, which has a high latency per read but also high enough
bandwidth that the over-read is essentially free).

We can get the best of both worlds by informing the page table iterator
how much we expect to need (at the cost of some additional complexity in
this admittedly already pretty complex code). Requiring an accurate end
would limit the flexibility of the page table iterator and be more
error-prone, so let's make it a non-binding hint.

Add the hint and use it in the x86-64 page table iterator to only read
as many entries as necessary. Also extend the test case for large page
table reads to test this better.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
osandov added a commit that referenced this pull request Oct 11, 2023
Peter Collingbourne reported that the over-reading we do in the AArch64
page table iterator uses too much bandwidth for remote targets. His
original proposal in #312 was to change the page table iterator to only
read one entry per level. However, this would regress large reads that
do end up using the additional entries (in particular when the target is
/proc/kcore, which has a high latency per read but also high enough
bandwidth that the over-read is essentially free).

We can get the best of both worlds by informing the page table iterator
how much we expect to need (at the cost of some additional complexity in
this admittedly already pretty complex code). Requiring an accurate end
would limit the flexibility of the page table iterator and be more
error-prone, so let's make it a non-binding hint.

Add the hint and use it in the x86-64 page table iterator to only read
as many entries as necessary. Also extend the test case for large page
table reads to test this better.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
@osandov
Copy link
Owner

osandov commented Oct 11, 2023

Just pushed the infrastructure for limiting the readahead and the x86-64 implementation: 747e028. It should be straightforward to copy this to AArch64 since that one was modeled after the x86-64 version. Let me know if you'd like to do that. I'd be happy to if not.

@pcc
Copy link
Contributor Author

pcc commented Oct 18, 2023

I tried your 512MB read program without/with this change on the M1. The results were as follows:

    N           Min           Max        Median           Avg        Stddev
x  10    0.10680043    0.10772684    0.10705753    0.10717723 0.00033758878
+  10    0.11364428    0.11520051      0.114269    0.11433144 0.00045813209
Difference at 95.0% confidence
        0.00715421 +/- 0.000378093
        6.67512% +/- 0.361236%
        (Student's t, pooled s = 0.0004024)

So it looks like it's a much smaller regression on arm64, and if that were the only number we had I don't think I would consider it to be worth the added complexity to support readahead because we don't typically do such large reads.

I wonder what setup you used to collect your numbers. Was it on bare metal or in a VM? Could there be anything causing syscall entry/exit to run more slowly?

@pcc
Copy link
Contributor Author

pcc commented Oct 18, 2023

I just noticed that your x86_64 patch does not cache the intermediate levels. I wonder if that is the explanation for the difference?

@pcc
Copy link
Contributor Author

pcc commented Nov 1, 2023

Ping, any thoughts on the above?

@osandov
Copy link
Owner

osandov commented Nov 1, 2023

You're right, I fixed the pte-at-a-time x86-64 iterator and got similar results to yours. I will revert the readahead change and merge this once I take a closer look at the change, likely before Thursday. Thanks for following up.

osandov added a commit that referenced this pull request Nov 1, 2023
This reverts commit 747e028 (except for
the test improvements). Peter Collingbourne noticed that the change I
used to test the performance of reading a single PTE at a time [1]
didn't cache higher level entries. Keeping that caching makes the
regression I was worried about negligible. So, there's no reason to add
the extra complexity of the hint.

1: #312 (comment)

Signed-off-by: Omar Sandoval <osandov@osandov.com>
@osandov
Copy link
Owner

osandov commented Nov 2, 2023

I reverted my change. I started taking a look at this but got derailed by other stuff, so I'll get back to this tomorrow.

Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments, but this looks reasonable, thanks.

libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved
libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved
libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved
@osandov
Copy link
Owner

osandov commented Nov 3, 2023

I noticed we could avoid some of the looping and all of the memsets by jumping straight to the first uncached level, which we can derive from the highest set bit in virt_addr ^ cached_virt_addr. And if we initialize cached_virt_addr cleverly, we can avoid a special case for the first iteration, too.

I iterated on top of your PR and pushed it here: https://github.com/osandov/drgn/tree/pte-per-level. It's a bit rough and is missing comments explaining the subtleties, but I'm curious if that's actually any faster. Would you mind benchmarking on your M1?

@pcc
Copy link
Contributor Author

pcc commented Nov 4, 2023

That's a neat trick with the xor! However, I benchmarked both implementations using the technique from #312 (comment) and it seems there was no statistically significant difference:

+-------------------------------------------------------------------------------------------------------------------------------------------+
|x        +       xx            x  +    x  + x +   x   +    x+ +         x       x        +                                           +    +|
|                |________|_______________A_______________M________|A________________________________________|                              |
+-------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     4.1495807     4.1727075     4.1616145     4.1614942  0.0073339332
+  10     4.1522676      4.189699      4.166033     4.1689961   0.012096091
No difference proven at 95.0% confidence
        0.00750187 +/- 0.00939835
        0.180269% +/- 0.22595%
        (Student's t, pooled s = 0.0100025)

Same result for the 512MB read program:

+-------------------------------------------------------------------------------------------------------------------------------------------+
|+                         +          x         x  +      +        xx++  +x+x +       x      xx            +                               x|
|                               |_________________|__________A_______M_____M__A___________|_______________|                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10    0.11364159    0.11529475    0.11425108    0.11430417 0.00045798308
+  10     0.1130408    0.11477116     0.1141585    0.11401902 0.00048005814
No difference proven at 95.0% confidence
        -0.000285147 +/- 0.000440812
        -0.249464% +/- 0.38519%
        (Student's t, pooled s = 0.00046915)

…evel

The current page table walker will on average read around half of the
entire page table for each level. This is inefficient, especially when
debugging a remote target which may have a low bandwidth connection to
the debugger. Address this by only reading one PTE per level.

I've only done the aarch64 page table walker because that's all that I
needed, but in principle the other page table walkers could work in a
similar way.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Copy link
Owner

@osandov osandov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer, I liked the xor trick :) I fixed a few style nitpicks in your version on my end, and I'll merge it now.

Thanks for your patience here. I owe you a few more PR reviews that I hope to get to in the coming days.

@osandov osandov merged commit e99921d into osandov:main Nov 6, 2023
6 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants