libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

pcc · 2023-06-28T00:07:05Z

…evel

The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level.

I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way.

brenns10 · 2023-06-28T06:25:11Z

I'm a bit confused, how are you debugging a remote target using drgn? Do you mean a vmcore on a remote filesystem?

I only ask because in my experience, /proc/kcore tends to be high latency, but not terribly low bandwidth. Normally performance optimizations involve reading more data at a time, with fewer read requests. If I understand this correctly, you're looking to do the reverse: read less data per request.

So I guess I'm curious what situation leads to this particular set of constraints you're facing?

pcc · 2023-06-28T20:42:08Z

I'm utilizing an SWD connection to the target via OpenOCD. I have a drgn patch to add OpenOCD support that I intend to share in the next few days, once I've removed some hardcoding for my target. For various reasons it isn't always feasible to remote mount my target's /proc/kcore, so I didn't consider that route. SWD also enables some additional capabilities as compared to /proc/kcore, such as being able to access MMIO registers on the target, and being able to debug non-Linux operating systems and bare-metal microcontroller firmware.

SWD connections tend to be fairly low bandwidth. For example, a common maximum adapter frequency is in the 10 MHz range. Taking into account the protocol overhead we have a theoretical maximum data transfer speed of 10 Mb/s * 32 / (32 + 15) = 6.81 Mbps = 851 KiB/s. But this is only theoretical: the design of the protocol between the host and the adapter can have a big impact. In my experience, data transfer speeds can vary between 10% and 50% of the theoretical maximum depending on the adapter protocol. So with the existing code, in the worst case this translates to on average (2*3)/85.1= 0.07s to transfer a 3-level page table, not taking into account round trip time.

With drgn's typical usage model, we have a large number of transfers of small amounts of data which each individually require a page table transfer, so it makes sense to optimize for the case where only a single PTE is required at each level.

osandov · 2023-06-29T23:43:44Z

As I alluded to in #315 and @brenns10 mentioned, this is done intentionally because /proc/kcore has a horrible per-read overhead (see also #106). linux_kernel_pgtable_iterator_next_x86_64() even has a comment justifying this that I failed to copy over to the AArch64 version. Doing it this way would be a huge regression for the /proc/kcore case, and vice versa, so although it will complicate things, we might need to choose the strategy based on whether the target is local or remote. Alternatively, we could give the page table iterator interface a hint about how much it should "read ahead".

pcc · 2023-06-30T00:25:10Z

Thanks for taking a look. I think it would be worth measuring the impact in practice; in my experience drgn typically does a large number of reads of small size (word size or less), which means that the additional PTEs loaded by the existing code are almost never needed.

I don't currently have a setup for debugging the kernel with a local /proc/kcore on aarch64. I will try to set that up and collect some performance numbers.

osandov · 2023-06-30T01:19:00Z

That's fair, you're right that the vast majority of reads are smaller than a page and thus don't benefit from the extra reads at all. I think the readahead hint idea is the most optimal across the board, but it could be complicated. I can take a stab at implementing it, and if it's too difficult, I think this patch would be okay.

pcc · 2023-06-30T03:33:24Z

I measured the performance on an Apple M1 Mac Mini running an Asahi Linux kernel (commit b5c05cbffb0488c7618106926d522cc3b43d93d5 from Asahi kernel repo) by running this 10 times and compared the results without and with this patch.

timeit.timeit(lambda: fs.path_lookup(prog, '/sys/bus/platform/devices'), number=10000)

The results were (x = without patch, + = with patch):

    N           Min           Max        Median           Avg        Stddev
x  10     5.2588345     5.3796673     5.2976387     5.3028653   0.034026569
+  10     5.1722865     5.3094247     5.2717144     5.2574367   0.044795798
Difference at 95.0% confidence
        -0.0454286 +/- 0.0373746
        -0.85668% +/- 0.702597%
        (Student's t, pooled s = 0.0397773)

So it looks like we're typically slightly faster with this patch, even in the local case.

osandov · 2023-06-30T03:53:21Z

Hm, is the page table iterator being called for this test case? I would expect that all of these memory reads would come from /proc/kcore, which is added here:

drgn/libdrgn/program.c

Line 456 in cc0994a

err = drgn_program_add_memory_segment(prog, phdr->p_vaddr,

pcc · 2023-06-30T04:16:00Z

Right, it isn't being called, as you also pointed out in #310. So the above was just measurement noise.

So what do you reckon would be a good test case for this? Maybe we can read a userspace process's memory?

osandov · 2023-06-30T04:53:30Z

cmdline() and environ() read via the page table under the hood and are fairly realistic use cases, so those might make for a good test. Something like:

import os
task = find_task(prog, os.getpid())
# Then use timeit to measure cmdline(task) and environ(task)

pcc · 2023-06-30T05:55:05Z

Thanks, confirmed that environ(task) does end up calling the page table iterator. I did:

timeit.timeit(lambda: environ(task), number=1000000)

With the PR as uploaded, we were a bit slower with this patch:

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     5.3784772     5.5665043     5.4984312     5.4838718   0.064820429

This was a bit of a surprising result because linux_kernel_pgtable_iterator_next_aarch64 was only being called once for environ(task). I investigated and looks like it was due to the memset on line 470 of my patch clearing the same memory multiple times, which I previously thought would be insignificant compared to actually reading the page table. As a test I removed it and we were faster with this patch.

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     4.6734759     4.8093641     4.7221596     4.7194108    0.04111319
Difference at 95.0% confidence
        -0.346312 +/- 0.0330537
        -6.83637% +/- 0.638692%
        (Student's t, pooled s = 0.0351787)

Removing the memset isn't correct in general so I'll look at a proper fix.

pcc · 2023-06-30T21:25:20Z

With the patch I just uploaded the results look like this:

    N           Min           Max        Median           Avg        Stddev
x  10      5.034446      5.132914     5.0561243     5.0657224   0.028014023
+  10     4.1080728     4.2879725     4.1325089     4.1497191   0.053896147
Difference at 95.0% confidence
        -0.916003 +/- 0.0403566
        -18.0824% +/- 0.768284%
        (Student's t, pooled s = 0.042951)

osandov · 2023-10-10T00:04:21Z

The other case to test that I should've mentioned is a large read. I implemented the equivalent to this PR for x86-64:

 libdrgn/arch_x86_64.c | 70 ++++++++++++++++++---------------------------------
 1 file changed, 24 insertions(+), 46 deletions(-)

diff --git a/libdrgn/arch_x86_64.c b/libdrgn/arch_x86_64.c
index 2af98f70..f1c554c8 100644
--- a/libdrgn/arch_x86_64.c
+++ b/libdrgn/arch_x86_64.c
@@ -600,11 +600,10 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
 	static const uint64_t PSE = 0x80; /* a.k.a. huge page */
 	static const uint64_t ADDRESS_MASK = UINT64_C(0xffffffffff000);
 	struct drgn_error *err;
-	bool bswap = drgn_platform_bswap(&prog->platform);
 	struct pgtable_iterator_x86_64 *it =
 		container_of(_it, struct pgtable_iterator_x86_64, it);
 	uint64_t virt_addr = it->it.virt_addr;
-	int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4, level;
+	int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4;
 
 	uint64_t start_non_canonical =
 		(UINT64_C(1) <<
@@ -619,52 +618,31 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
 		return NULL;
 	}
 
-	/* Find the lowest level with cached entries. */
-	for (level = 0; level < levels; level++) {
-		if (it->index[level] < array_size(it->table[level]))
-			break;
-	}
-	/* For every level below that, refill the cache/return pages. */
-	for (;; level--) {
-		uint64_t table;
-		bool table_physical;
-		uint16_t index;
-		if (level == levels) {
-			table = it->it.pgtable;
-			table_physical = false;
-		} else {
-			uint64_t entry = it->table[level][it->index[level]++];
-			if (bswap)
-				entry = bswap_64(entry);
-			table = entry & ADDRESS_MASK;
-			if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
-				uint64_t mask = (UINT64_C(1) <<
-						 (PAGE_SHIFT +
-						  PGTABLE_SHIFT * level)) - 1;
-				*virt_addr_ret = virt_addr & ~mask;
-				if (entry & PRESENT)
-					*phys_addr_ret = table & ~mask;
-				else
-					*phys_addr_ret = UINT64_MAX;
-				it->it.virt_addr = (virt_addr | mask) + 1;
-				return NULL;
-			}
-			table_physical = true;
-		}
-		index = (virt_addr >>
-			 (PAGE_SHIFT + PGTABLE_SHIFT * (level - 1))) & PGTABLE_MASK;
-		/*
-		 * It's only marginally more expensive to read 4096 bytes than 8
-		 * bytes, so we always read to the end of the table.
-		 */
-		err = drgn_program_read_memory(prog,
-					       &it->table[level - 1][index],
-					       table + 8 * index,
-					       sizeof(it->table[0]) - 8 * index,
-					       table_physical);
+	uint64_t table = it->it.pgtable;
+	bool table_physical = false;
+	for (int level = levels - 1; ; level--) {
+		uint16_t index = ((virt_addr >>
+				  (PAGE_SHIFT + PGTABLE_SHIFT * level))
+				  & PGTABLE_MASK);
+		uint64_t entry;
+		err = drgn_program_read_u64(prog, table + 8 * index,
+					    table_physical, &entry);
 		if (err)
 			return err;
-		it->index[level - 1] = index;
+		if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
+			uint64_t mask = (UINT64_C(1) <<
+					 (PAGE_SHIFT +
+					  PGTABLE_SHIFT * level)) - 1;
+			*virt_addr_ret = virt_addr & ~mask;
+			if (entry & PRESENT)
+				*phys_addr_ret = entry & ADDRESS_MASK & ~mask;
+			else
+				*phys_addr_ret = UINT64_MAX;
+			it->it.virt_addr = (virt_addr | mask) + 1;
+			return NULL;
+		}
+		table = entry & ADDRESS_MASK;
+		table_physical = true;
 	}
 }

And timed reading a 512 MB chunk from userspace while attached to /proc/kcore:

import ctypes
import mmap
import os
import time

from drgn.helpers.linux.mm import access_remote_vm
from drgn.helpers.linux.pid import find_task

size = 512 * 1024 * 1024
map = mmap.mmap(-1, size, flags=mmap.MAP_PRIVATE | mmap.MAP_POPULATE)
address = ctypes.addressof(ctypes.c_char.from_buffer(map))
mm = find_task(prog, os.getpid()).mm

start = time.monotonic()
access_remote_vm(mm, address, size)
print(time.monotonic() - start)

This took about 170 ms with the original code, and 300 ms with the change to read one PTE at a time.

This isn't super contrived: I've needed to do a large read so I could sift through it for certain values more efficiently before. So I think there is still value in reading additional PTEs, although your use case makes it clear that we shouldn't read any more than are strictly necessary.

I will take a stab at implementing a readahead hint for x86-64 tomorrow, then likely ask for your help for AArch64. If it adds too much complexity, then I'll drop it and go with this approach.

Peter Collingbourne reported that the over-reading we do in the AArch64 page table iterator uses too much bandwidth for remote targets. His original proposal in #312 was to change the page table iterator to only read one entry per level. However, this would regress large reads that do end up using the additional entries (in particular when the target is /proc/kcore, which has a high latency per read but also high enough bandwidth that the over-read is essentially free). We can get the best of both worlds by informing the page table iterator how much we expect to need (at the cost of some additional complexity in this admittedly already pretty complex code). Requiring an accurate end would limit the flexibility of the page table iterator and be more error-prone, so let's make it a non-binding hint. Add the hint and use it in the x86-64 page table iterator to only read as many entries as necessary. Also extend the test case for large page table reads to test this better. Signed-off-by: Omar Sandoval <osandov@osandov.com>

osandov · 2023-10-11T19:09:20Z

Just pushed the infrastructure for limiting the readahead and the x86-64 implementation: 747e028. It should be straightforward to copy this to AArch64 since that one was modeled after the x86-64 version. Let me know if you'd like to do that. I'd be happy to if not.

pcc · 2023-10-18T04:15:43Z

I tried your 512MB read program without/with this change on the M1. The results were as follows:

    N           Min           Max        Median           Avg        Stddev
x  10    0.10680043    0.10772684    0.10705753    0.10717723 0.00033758878
+  10    0.11364428    0.11520051      0.114269    0.11433144 0.00045813209
Difference at 95.0% confidence
        0.00715421 +/- 0.000378093
        6.67512% +/- 0.361236%
        (Student's t, pooled s = 0.0004024)

So it looks like it's a much smaller regression on arm64, and if that were the only number we had I don't think I would consider it to be worth the added complexity to support readahead because we don't typically do such large reads.

I wonder what setup you used to collect your numbers. Was it on bare metal or in a VM? Could there be anything causing syscall entry/exit to run more slowly?

pcc · 2023-10-18T04:19:28Z

I just noticed that your x86_64 patch does not cache the intermediate levels. I wonder if that is the explanation for the difference?

pcc · 2023-11-01T00:55:05Z

Ping, any thoughts on the above?

osandov · 2023-11-01T03:32:41Z

You're right, I fixed the pte-at-a-time x86-64 iterator and got similar results to yours. I will revert the readahead change and merge this once I take a closer look at the change, likely before Thursday. Thanks for following up.

This reverts commit 747e028 (except for the test improvements). Peter Collingbourne noticed that the change I used to test the performance of reading a single PTE at a time [1] didn't cache higher level entries. Keeping that caching makes the regression I was worried about negligible. So, there's no reason to add the extra complexity of the hint. 1: #312 (comment) Signed-off-by: Omar Sandoval <osandov@osandov.com>

osandov · 2023-11-02T23:47:22Z

I reverted my change. I started taking a look at this but got derailed by other stuff, so I'll get back to this tomorrow.

osandov

A couple of minor comments, but this looks reasonable, thanks.

libdrgn/arch_aarch64.c

osandov · 2023-11-03T23:52:34Z

I noticed we could avoid some of the looping and all of the memsets by jumping straight to the first uncached level, which we can derive from the highest set bit in virt_addr ^ cached_virt_addr. And if we initialize cached_virt_addr cleverly, we can avoid a special case for the first iteration, too.

I iterated on top of your PR and pushed it here: https://github.com/osandov/drgn/tree/pte-per-level. It's a bit rough and is missing comments explaining the subtleties, but I'm curious if that's actually any faster. Would you mind benchmarking on your M1?

pcc · 2023-11-04T02:11:46Z

That's a neat trick with the xor! However, I benchmarked both implementations using the technique from #312 (comment) and it seems there was no statistically significant difference:

+-------------------------------------------------------------------------------------------------------------------------------------------+
|x        +       xx            x  +    x  + x +   x   +    x+ +         x       x        +                                           +    +|
|                |________|_______________A_______________M________|A________________________________________|                              |
+-------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     4.1495807     4.1727075     4.1616145     4.1614942  0.0073339332
+  10     4.1522676      4.189699      4.166033     4.1689961   0.012096091
No difference proven at 95.0% confidence
        0.00750187 +/- 0.00939835
        0.180269% +/- 0.22595%
        (Student's t, pooled s = 0.0100025)

Same result for the 512MB read program:

+-------------------------------------------------------------------------------------------------------------------------------------------+
|+                         +          x         x  +      +        xx++  +x+x +       x      xx            +                               x|
|                               |_________________|__________A_______M_____M__A___________|_______________|                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10    0.11364159    0.11529475    0.11425108    0.11430417 0.00045798308
+  10     0.1130408    0.11477116     0.1141585    0.11401902 0.00048005814
No difference proven at 95.0% confidence
        -0.000285147 +/- 0.000440812
        -0.249464% +/- 0.38519%
        (Student's t, pooled s = 0.00046915)

…evel The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level. I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way. Signed-off-by: Peter Collingbourne <pcc@google.com>

osandov

Bummer, I liked the xor trick :) I fixed a few style nitpicks in your version on my end, and I'll merge it now.

Thanks for your patience here. I owe you a few more PR reviews that I hope to get to in the coming days.

pcc force-pushed the pte-per-level branch from 0f180b6 to ff688c8 Compare June 28, 2023 01:19

osandov mentioned this pull request Jun 29, 2023

Add DRGN_PROGRAM_IS_LOCAL flag #315

Merged

pcc force-pushed the pte-per-level branch from ff688c8 to 0531018 Compare June 30, 2023 21:24

This was referenced Oct 2, 2023

libdrgn: add support for remote debugging via OpenOCD #338

Open

libdrgn: improve C string reading efficiency #313

Open

osandov requested changes Nov 3, 2023

View reviewed changes

libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved

libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved

osandov reviewed Nov 3, 2023

View reviewed changes

libdrgn/arch_aarch64.c Outdated Show resolved Hide resolved

pcc force-pushed the pte-per-level branch from 0531018 to 944ef8c Compare November 4, 2023 02:08

pcc force-pushed the pte-per-level branch from 944ef8c to 1219d3f Compare November 4, 2023 02:11

osandov force-pushed the pte-per-level branch from 1219d3f to 6e132f4 Compare November 6, 2023 21:07

osandov approved these changes Nov 6, 2023

View reviewed changes

osandov merged commit e99921d into osandov:main Nov 6, 2023
6 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

pcc commented Jun 28, 2023

brenns10 commented Jun 28, 2023

pcc commented Jun 28, 2023

osandov commented Jun 29, 2023

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023 •

edited

Loading

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023

pcc commented Jun 30, 2023

pcc commented Jun 30, 2023

osandov commented Oct 10, 2023

osandov commented Oct 11, 2023

pcc commented Oct 18, 2023

pcc commented Oct 18, 2023

pcc commented Nov 1, 2023

osandov commented Nov 1, 2023

osandov commented Nov 2, 2023

osandov left a comment

osandov commented Nov 3, 2023 •

edited

Loading

pcc commented Nov 4, 2023

osandov left a comment

libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312

Conversation

pcc commented Jun 28, 2023

brenns10 commented Jun 28, 2023

pcc commented Jun 28, 2023

osandov commented Jun 29, 2023

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023 • edited Loading

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023

pcc commented Jun 30, 2023

osandov commented Jun 30, 2023

pcc commented Jun 30, 2023

pcc commented Jun 30, 2023

osandov commented Oct 10, 2023

osandov commented Oct 11, 2023

pcc commented Oct 18, 2023

pcc commented Oct 18, 2023

pcc commented Nov 1, 2023

osandov commented Nov 1, 2023

osandov commented Nov 2, 2023

osandov left a comment

Choose a reason for hiding this comment

osandov commented Nov 3, 2023 • edited Loading

pcc commented Nov 4, 2023

osandov left a comment

Choose a reason for hiding this comment

osandov commented Jun 30, 2023 •

edited

Loading

osandov commented Nov 3, 2023 •

edited

Loading