-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libdrgn: aarch64: Rework page table walker to only read one PTE per l… #312
Conversation
I'm a bit confused, how are you debugging a remote target using drgn? Do you mean a vmcore on a remote filesystem? I only ask because in my experience, So I guess I'm curious what situation leads to this particular set of constraints you're facing? |
I'm utilizing an SWD connection to the target via OpenOCD. I have a drgn patch to add OpenOCD support that I intend to share in the next few days, once I've removed some hardcoding for my target. For various reasons it isn't always feasible to remote mount my target's SWD connections tend to be fairly low bandwidth. For example, a common maximum adapter frequency is in the 10 MHz range. Taking into account the protocol overhead we have a theoretical maximum data transfer speed of 10 Mb/s * 32 / (32 + 15) = 6.81 Mbps = 851 KiB/s. But this is only theoretical: the design of the protocol between the host and the adapter can have a big impact. In my experience, data transfer speeds can vary between 10% and 50% of the theoretical maximum depending on the adapter protocol. So with the existing code, in the worst case this translates to on average (2*3)/85.1= 0.07s to transfer a 3-level page table, not taking into account round trip time. With drgn's typical usage model, we have a large number of transfers of small amounts of data which each individually require a page table transfer, so it makes sense to optimize for the case where only a single PTE is required at each level. |
As I alluded to in #315 and @brenns10 mentioned, this is done intentionally because |
Thanks for taking a look. I think it would be worth measuring the impact in practice; in my experience drgn typically does a large number of reads of small size (word size or less), which means that the additional PTEs loaded by the existing code are almost never needed. I don't currently have a setup for debugging the kernel with a local |
That's fair, you're right that the vast majority of reads are smaller than a page and thus don't benefit from the extra reads at all. I think the readahead hint idea is the most optimal across the board, but it could be complicated. I can take a stab at implementing it, and if it's too difficult, I think this patch would be okay. |
I measured the performance on an Apple M1 Mac Mini running an Asahi Linux kernel (commit b5c05cbffb0488c7618106926d522cc3b43d93d5 from Asahi kernel repo) by running this 10 times and compared the results without and with this patch.
The results were (
So it looks like we're typically slightly faster with this patch, even in the local case. |
Hm, is the page table iterator being called for this test case? I would expect that all of these memory reads would come from Line 456 in cc0994a
|
Right, it isn't being called, as you also pointed out in #310. So the above was just measurement noise. So what do you reckon would be a good test case for this? Maybe we can read a userspace process's memory? |
Thanks, confirmed that
With the PR as uploaded, we were a bit slower with this patch:
This was a bit of a surprising result because
Removing the memset isn't correct in general so I'll look at a proper fix. |
With the patch I just uploaded the results look like this:
|
The other case to test that I should've mentioned is a large read. I implemented the equivalent to this PR for x86-64: libdrgn/arch_x86_64.c | 70 ++++++++++++++++++---------------------------------
1 file changed, 24 insertions(+), 46 deletions(-)
diff --git a/libdrgn/arch_x86_64.c b/libdrgn/arch_x86_64.c
index 2af98f70..f1c554c8 100644
--- a/libdrgn/arch_x86_64.c
+++ b/libdrgn/arch_x86_64.c
@@ -600,11 +600,10 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
static const uint64_t PSE = 0x80; /* a.k.a. huge page */
static const uint64_t ADDRESS_MASK = UINT64_C(0xffffffffff000);
struct drgn_error *err;
- bool bswap = drgn_platform_bswap(&prog->platform);
struct pgtable_iterator_x86_64 *it =
container_of(_it, struct pgtable_iterator_x86_64, it);
uint64_t virt_addr = it->it.virt_addr;
- int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4, level;
+ int levels = prog->vmcoreinfo.pgtable_l5_enabled ? 5 : 4;
uint64_t start_non_canonical =
(UINT64_C(1) <<
@@ -619,52 +618,31 @@ linux_kernel_pgtable_iterator_next_x86_64(struct drgn_program *prog,
return NULL;
}
- /* Find the lowest level with cached entries. */
- for (level = 0; level < levels; level++) {
- if (it->index[level] < array_size(it->table[level]))
- break;
- }
- /* For every level below that, refill the cache/return pages. */
- for (;; level--) {
- uint64_t table;
- bool table_physical;
- uint16_t index;
- if (level == levels) {
- table = it->it.pgtable;
- table_physical = false;
- } else {
- uint64_t entry = it->table[level][it->index[level]++];
- if (bswap)
- entry = bswap_64(entry);
- table = entry & ADDRESS_MASK;
- if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
- uint64_t mask = (UINT64_C(1) <<
- (PAGE_SHIFT +
- PGTABLE_SHIFT * level)) - 1;
- *virt_addr_ret = virt_addr & ~mask;
- if (entry & PRESENT)
- *phys_addr_ret = table & ~mask;
- else
- *phys_addr_ret = UINT64_MAX;
- it->it.virt_addr = (virt_addr | mask) + 1;
- return NULL;
- }
- table_physical = true;
- }
- index = (virt_addr >>
- (PAGE_SHIFT + PGTABLE_SHIFT * (level - 1))) & PGTABLE_MASK;
- /*
- * It's only marginally more expensive to read 4096 bytes than 8
- * bytes, so we always read to the end of the table.
- */
- err = drgn_program_read_memory(prog,
- &it->table[level - 1][index],
- table + 8 * index,
- sizeof(it->table[0]) - 8 * index,
- table_physical);
+ uint64_t table = it->it.pgtable;
+ bool table_physical = false;
+ for (int level = levels - 1; ; level--) {
+ uint16_t index = ((virt_addr >>
+ (PAGE_SHIFT + PGTABLE_SHIFT * level))
+ & PGTABLE_MASK);
+ uint64_t entry;
+ err = drgn_program_read_u64(prog, table + 8 * index,
+ table_physical, &entry);
if (err)
return err;
- it->index[level - 1] = index;
+ if (!(entry & PRESENT) || (entry & PSE) || level == 0) {
+ uint64_t mask = (UINT64_C(1) <<
+ (PAGE_SHIFT +
+ PGTABLE_SHIFT * level)) - 1;
+ *virt_addr_ret = virt_addr & ~mask;
+ if (entry & PRESENT)
+ *phys_addr_ret = entry & ADDRESS_MASK & ~mask;
+ else
+ *phys_addr_ret = UINT64_MAX;
+ it->it.virt_addr = (virt_addr | mask) + 1;
+ return NULL;
+ }
+ table = entry & ADDRESS_MASK;
+ table_physical = true;
}
}
And timed reading a 512 MB chunk from userspace while attached to import ctypes
import mmap
import os
import time
from drgn.helpers.linux.mm import access_remote_vm
from drgn.helpers.linux.pid import find_task
size = 512 * 1024 * 1024
map = mmap.mmap(-1, size, flags=mmap.MAP_PRIVATE | mmap.MAP_POPULATE)
address = ctypes.addressof(ctypes.c_char.from_buffer(map))
mm = find_task(prog, os.getpid()).mm
start = time.monotonic()
access_remote_vm(mm, address, size)
print(time.monotonic() - start) This took about 170 ms with the original code, and 300 ms with the change to read one PTE at a time. This isn't super contrived: I've needed to do a large read so I could sift through it for certain values more efficiently before. So I think there is still value in reading additional PTEs, although your use case makes it clear that we shouldn't read any more than are strictly necessary. I will take a stab at implementing a readahead hint for x86-64 tomorrow, then likely ask for your help for AArch64. If it adds too much complexity, then I'll drop it and go with this approach. |
Peter Collingbourne reported that the over-reading we do in the AArch64 page table iterator uses too much bandwidth for remote targets. His original proposal in #312 was to change the page table iterator to only read one entry per level. However, this would regress large reads that do end up using the additional entries (in particular when the target is /proc/kcore, which has a high latency per read but also high enough bandwidth that the over-read is essentially free). We can get the best of both worlds by informing the page table iterator how much we expect to need (at the cost of some additional complexity in this admittedly already pretty complex code). Requiring an accurate end would limit the flexibility of the page table iterator and be more error-prone, so let's make it a non-binding hint. Add the hint and use it in the x86-64 page table iterator to only read as many entries as necessary. Also extend the test case for large page table reads to test this better. Signed-off-by: Omar Sandoval <osandov@osandov.com>
Peter Collingbourne reported that the over-reading we do in the AArch64 page table iterator uses too much bandwidth for remote targets. His original proposal in #312 was to change the page table iterator to only read one entry per level. However, this would regress large reads that do end up using the additional entries (in particular when the target is /proc/kcore, which has a high latency per read but also high enough bandwidth that the over-read is essentially free). We can get the best of both worlds by informing the page table iterator how much we expect to need (at the cost of some additional complexity in this admittedly already pretty complex code). Requiring an accurate end would limit the flexibility of the page table iterator and be more error-prone, so let's make it a non-binding hint. Add the hint and use it in the x86-64 page table iterator to only read as many entries as necessary. Also extend the test case for large page table reads to test this better. Signed-off-by: Omar Sandoval <osandov@osandov.com>
Just pushed the infrastructure for limiting the readahead and the x86-64 implementation: 747e028. It should be straightforward to copy this to AArch64 since that one was modeled after the x86-64 version. Let me know if you'd like to do that. I'd be happy to if not. |
I tried your 512MB read program without/with this change on the M1. The results were as follows:
So it looks like it's a much smaller regression on arm64, and if that were the only number we had I don't think I would consider it to be worth the added complexity to support readahead because we don't typically do such large reads. I wonder what setup you used to collect your numbers. Was it on bare metal or in a VM? Could there be anything causing syscall entry/exit to run more slowly? |
I just noticed that your x86_64 patch does not cache the intermediate levels. I wonder if that is the explanation for the difference? |
Ping, any thoughts on the above? |
You're right, I fixed the pte-at-a-time x86-64 iterator and got similar results to yours. I will revert the readahead change and merge this once I take a closer look at the change, likely before Thursday. Thanks for following up. |
This reverts commit 747e028 (except for the test improvements). Peter Collingbourne noticed that the change I used to test the performance of reading a single PTE at a time [1] didn't cache higher level entries. Keeping that caching makes the regression I was worried about negligible. So, there's no reason to add the extra complexity of the hint. 1: #312 (comment) Signed-off-by: Omar Sandoval <osandov@osandov.com>
I reverted my change. I started taking a look at this but got derailed by other stuff, so I'll get back to this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor comments, but this looks reasonable, thanks.
I noticed we could avoid some of the looping and all of the memsets by jumping straight to the first uncached level, which we can derive from the highest set bit in I iterated on top of your PR and pushed it here: https://github.com/osandov/drgn/tree/pte-per-level. It's a bit rough and is missing comments explaining the subtleties, but I'm curious if that's actually any faster. Would you mind benchmarking on your M1? |
That's a neat trick with the xor! However, I benchmarked both implementations using the technique from #312 (comment) and it seems there was no statistically significant difference:
Same result for the 512MB read program:
|
…evel The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level. I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way. Signed-off-by: Peter Collingbourne <pcc@google.com>
1219d3f
to
6e132f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer, I liked the xor trick :) I fixed a few style nitpicks in your version on my end, and I'll merge it now.
Thanks for your patience here. I owe you a few more PR reviews that I hope to get to in the coming days.
…evel
The current page table walker will on average read around half of the entire page table for each level. This is inefficient, especially when debugging a remote target which may have a low bandwidth connection to the debugger. Address this by only reading one PTE per level.
I've only done the aarch64 page table walker because that's all that I needed, but in principle the other page table walkers could work in a similar way.