-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide random-access methods to PBF reader #367
Conversation
Interesting work. I am not sure how suitable this is for Osmium, but it is certainly something that we can discuss. The main question for me is: What use case does this support and are they worth the extra effort? The getid use case is a rather niche use case. Anybody who does more than a few queries like this will use some kind of database, like OSMExpress. There are several implementations of specialised databases like this. What I do like about this approach is that it still works with the PBF file and doesn't need an extra index on disk or so. That puts it into the realm of what could fit into libosmium. But that also limits is usefulness somewhat, because you have to build that internal index every time, which needs time, but from your experiments it seems that isn't a big issue. (I am wondering about that a bit, because PBF blocks are usually gzipped and need to be decoded to look at the contents which does take time, that isn't something that I'd want to do twice if I can avoid it.) Osmium already supports reading only the PBF blocks needed based on the object type (node, way, or relation) and that is used often. It doesn't remember where the blocks for the types are in the file though, because doing that didn't fit into the current IO architecture. Regarding your ideas on possible changes to the PBF file format: In theory this is backwards compatible due to the nature of the protobuf-based file format. In practice though it is not that easy. There is at least one popular tool written in C that can only understand OSM PBF files as they are now and doesn't cope well with changes. Yes, technically it is not doing the right thing, but practically nobody wants to generate PBF files that are incompatible with that tool. I have run into that problem before. (We could still add something and make it optional for use, though.) |
The use-cases are simple:
My particular use case that lead me to write this, is: "Scan all the objects, check some weird properties (that cannot be done through overpass), and for a handful of objects try to determine any location related to this object." So that's a linear scan, plus a few random lookups. Doing a linear scan for the random lookups is very expensive, and doesn't always answer the question (i.e. relations). I tried doing it with multiple linear passes, or abusing the Converting the pbf into a different database format seems silly: It adds an hour and consumes a lot of space. A linear scan plus random accesses should be finished before the database format conversion is done. Why do extra work when the data is already right there? It seems that your biggest worry is "the extra effort": That's why I focused so much on the 68ms to index the entire planet. Building the internal index "every time" is only true for program startup – everything in osmium is slower than 68ms, and those can probably be optimized down. (In fact, the 68ms are a lie, because I ran the entire "io_test_pbf_randomaccess" test suite, so the 68ms also include the other tests in the file.)
You're right, that operation would take a lot of time, so the code simply doesn't do that when building the index. In fact, simply reading the compressed data also takes a significant amount of time, that's why the code doesn't even read the data at all when building the index! And also, that's why I added a seek function to (If one were to end up reading every single block of the file, this would indeed be slower than a linear scan. But that's not what random access is for.)
I agree! That's why I mentioned a "caching" wrapper that keeps around the last X decoded blocks.
Ah, that is unfortunate. But that would have been a separate topic anyway, random access doesn't need any new or unusual features. (It does need some form of ordering, which already is the case.) |
I don't understand. How can you build an index without knowing the contents of the data? The index does a lookup from ID to the block the object with that ID is in, does it not? But for that I have to know what ID range is in that index, don't I? |
The index contains the block starts, i.e. their byte offset in the file, not necessarily their logical contents. (The first item in a block will be cached if the block is ever loaded, but it is not known initially.) Let's take a look at each entry (simplified for demonstration):
Maybe an example helps: Let's say we just opened a file, and it contains 10000 blocks. And now someone looks for node 123456. Because no This may seem at first like a lot of effort, but note that:
Does this answer your questions? |
Okay, I understand you are creating the index lazyly. That make sense. I small thing: You mentioned the planet having only 42175 blocks. That might be true for the planet you can download from OSM, which is not generated with libosmium. Libosmium will not store more than 8000 objects for blocks (another thing that could be improved in libosmium but is not easily done), so a planet generated by libosmium, say when you updated a downloaded planet will have something like 1.2 million blocks. That's nearly 30 times your number. But it doesn't matter that much because it is still only 30 MB for the index. I want to come back to the use cases and the "extra effort". I think with misunderstood each other there. I am talking about programming and maintaining effort. Clearly you think the effort is worth it, you wouldn't do this otherwise. But I am thinking about which actual use cases this would help with and what the burden is on me to maintain this and on the users to understand the different interfaces that would be available and all that. There are some downsides to your approach: it only works with (sorted) PBF files, it isn't multithreaded as far as I understand it, it has a new interface, which means the developer has to decide which interface to use for which use case. I want to better understand which actual typical use cases this makes better. And querying a single id or a few ids from a file isn't a use case that comes up much. A far more typical use case is for instance: Find all ways tagged with xyz, find all the member nodes, and output all that to a file. (Basically what Having said all of that I do think your approach has merit, even if it turns out that it isn't for everybody and every use case. So let's think about next practical steps here. First is: As you mention, we have to get rid of the code duplication. I think it should be possible to factor out low-level code from the current PBF reader into free functions that can the be used by the (updated) current PBF reader and by your code. Whatever we do we have to do something like this anyway, so we can bring this in in pieces small enough that I can review them. Then your other changes become much smaller. And you have already something you can work with in your code, albeit rather low-level. Then we have to think about the API for the new code, because once that's in a published libosmium version, it is hard to change, taking into account possible later additions like caching and so on. |
…nternal Turns out, auto_grow::yes causes significant runtime overhead due to all the memmoves.
I also expect 1.2 million blocks, and doing
Of course, loading the pages into memory would make this slower; but since we only read the BlockHeaders, they fit in the filesystem cache.
That's a good point, and I'm aware what a pain "legacy stuff" can be. I'll try to address some random points that first come to mind:
So I'm not really worried about it becoming a pain to maintain, even if I suddenly vanish. Are there any particular pain points that you would like me to address?
Correct. It seems to me that expecting files to be sorted is not unusual; it is already assumed by default:
Correct, the prototype isn't multithreaded at all, simply because I first want to get a single-threaded version working. I'm currently looking at timing profiles more closely, and it seems like indeed a lot of time is spent in zlib and simply decoding buffers. Not a big surprise, but I just wanted to see it with my own eyes. I think this approach can be multi-threaded, although it won't scale perfectly. Here's some free-form brainstorming on the topic of multi-threading, I haven't thought about this too deeply yet:
I hope this convinces you that this is not a big disadvantage. Let's talk about the finer points of multi-threading later, I want to do it single-threaded first, as random-access already gives a huge boost in speed for some usecases.
Well, yes, but the developer needs to make some decisions anyway. How about this: "Use the normal, linear-scan based Handler classes first. If this is not fast enough, and you know you are working with ordered PBFs, then consider using random-access."
You're right, a simple linear-scan is still the best way to do a job of the type "just go through all objects and check some custom predicate". I don't think a filter for all buildings will benefit at all from random access, unless you already have a list of all the IDs at the start. Maybe it helps if I expand on my example "Scan all the objects, check some weird properties (that cannot be done through overpass), and then for a handful of objects try to determine any location related to this object." The user currently has the options:
I hope I could finally show why the "just do multiple passes" approach seems so tedious and wasteful to me in those cases. Of course, beyond a certain percentage of objects (>5% I would guess), doing multiple passes is probably faster.
No, random-access is not "always faster". Is that perhaps the misunderstanding here? I do not claim that random-access is strictly better in all cases, just that it is faster (by large factors) for some reasonable usecases.
Splitting this into multiple, easy-to-review PRs sounds great! That's one of the reasons I created this PR in draft mode. I think I'll split it like this:
The above commits also introduce a mechanism to pass auto-grow behavior through to the reader. Turns out, this isn't a good idea (it causes lots of memmoves, which consume several percent runtime), so it won't end up in any PR. |
We have too many open "threads" here and it gets really confusing. I'll try to address some of the issues...
Problem is that when you write generic code like in osmium-tool, you don't know all these things beforehand. You have to either let the user decide through some command line option and/or write code to handle each case, adding some magic to figure out which approach is probably going to be faster and all that. The need to do all that limits the usefulness of having several approaches in practice. |
You're right, just scrolling over my previous comment took way too long. I'll try to be more succinct.
|
As I said, the |
Let's just make fast random-accesses available to everyone, and then you can use it to your heart's content. After all, that's the spirit of free software: Using, understanding, improving, sharing. That's why I'm trying to get the improvements into libosmium. I'm getting the impression that you feel very negatively about it all, can you help me understand why? Surely you see the benefit of speeding up some types of operations, even if it doesn't affect other types of operations? |
As mentioned the problem is simply that this all means more work for me and I have to take over the maintainance burden. It costs me a lot of time to understand what you want and to review pull requests. If you could demonstrate that your code will help with things I need or that I think a lot of people will need, that would make me more interested. But I don't see that (yet). That being said I am happy to accommodate your use case if that can be done with limited effort on my part. So lets concentrate on the one thing that seems to get you the most for the least amount of effort on my part: Get the PBF low level code somewhere else so that your alternative I/O method can use it, get that into libosmium. And don't get sidetracked by a 1 or 2% performance improvement somewhere else. |
Motivation: Why random-access?
Osmium is highly optimized for sequential access. That's awesome! Even for files with hundreds of megabytes it is usually good enough to simply execute multiple linear scans to collect all the necessary data. That seems to be pretty much what everyone seems to do, from RelationManager to osmium getid. However, beyond a certain input size and around a certain medium number of queries, linear scan becomes impractically slow.
The current solution seems to be to generate (= expensive in time) and keep around huge index files (= expensive in space). This is a great approach when there is a huge amount of queries, and the PBF file changes comparatively rarely. However, this is not a good option when there are only a few, complex queries, e.g. when walking the OSM object graph in a weird way.
Hence: Random access! It answers a single query much faster than a linear scan (but slower than a fully-indexed file), and is much cheaper in time and space than handling a fully-indexed file (but slightly more up-front effort and less efficient than a linear scan).
Or in short: Random access is pareto-optimal, and some number of cases a better choice than file-index or linear-scan. Please enjoy this terrible drawing of a pareto graph:
How does this work?
Random access exploits two separate properties:
Sort.Type_then_ID
, which is an "optional feature". This way, one can look at a decoded block, and immediately know whether a desired OSMObject should be expected in an earlier block, or a later block.Together, this enables in-memory binary search, O(log n). I'm sure I don't need to tell you, but for large n this is faster than O(n), and slower than O(1). With 72 GiB data, this certainly makes a difference!
In theory, there exist even more sophisticated approaches, like:
… but I don't want to go there, at least not now. Osmium can already be sped up a lot, right now, with the existing and already-popular file features.
Is it really faster?
I haven't implemented everything I want to implement (and need for a different project), but the rough first data is very clear:
pbf_block_start
struct is currently 24 bytes.cachestats
reportspages in cache: 43118/18629897 (0.2%)
, so that makes sense.osmium getid planet-231002.osm.pbf n5301526002 -f opl
takes roughly 2 minutes on first and second run, and has a MaxRSS of about 800 MiB.EDIT: I got the numbers for the "theoretical memory consumption" wrong: It's 24 bytes per entry, not 16 bytes. So the total is 989 KiB, not 659 KiB, for the entire planet.
This comparison isn't entirely fair, because the binary search isn't hooked up yet, and thus technically doesn't do the same job as
getid
. However, keep in mind that binary search will only need to read 16 blocks, and not 42175. Even if it ends up "only" 100 times faster, that would be a huge win.Who?
I do not claim to be the first person to have this idea. See for example this (abandoned?) repository: https://github.com/peermaps/random-access-osm-pbf
I couldn't find any other implementation though; this seems to be a rare thing to be done. I believe that many people might benefit if libosmium had this feature; especially users of
getid
.What's next?
This is my first contribution to this repository, and it goes against one central philosophy that libosmium has: sequential access. I tried my best to follow the desired style, but I'm sure there is room for improvement. Let's talk about it!
That is why I created this PR in draft mode.
Plus, there are many things I want to change before making this "public" as in "publish to stable":
osmium-getid
clone from that. Just to show off the speed difference properly.And in the longer future: