use vtzero instead of libprotobuf #624

cldellow · 2023-12-27T22:47:23Z

Replaced by #625

This is meant to be merged after #604. Until then, you can see the vtzero-specific diffs here: https://github.com/cldellow/tilemaker/compare/lua-interop...cldellow:vtzero?expand=1 (I can rebase this against something other than lua-interop, if needed.)

Switch from libprotobuf to vtzero, and fix a few places where we copied the geometries during writing.

vs lua-interop branch, this branch is about 7% faster to process GB: 1m43s vs 1m51s

I don't use --merge, so my testing of the --merge mode was somewhat artificial. I built an mbtiles, then built it again with --merge and checked that features seemed duplicated.

Currently, Tilemaker uses member functions for interop: ```lua function node_function(node) node:Layer(...) ``` This PR changes Tilemaker to use global functions: ```lua function node_function() Layer(...) ``` The chief rationale is performance. Every member function call needs to push an extra pointer onto the stack when crossing the Lua/C++ boundary. Kaguya serializes this pointer as a Lua userdata. That means every call into Lua has to malloc some memory, and every call back from Lua has to dereference through this pointer. And there are a lot of calls! For OMT on the GB extract, I counted ~1.4B calls from Lua into C++. A secondary rationale is that a global function is a bit more honest. A user might believe that this is currently permissible: ```lua last_node = nil function node_function(node) if last_node ~= nil -- do something with last_node end -- save the current node for later, for some reason last_node = node ``` But in reality, the OSM objects we pass into Lua don't behave quite like Lua objects. They're backed by OsmLuaProcessing, who will move on, invalidating whatever the user thinks they've got a reference to. This PR has a noticeable decrease in reading time for me, measured on the OMT profile for GB, on a 16-core computer: Before: ``` real 1m28.230s user 19m30.281s sys 0m29.610s ``` After: ``` real 1m21.728s user 17m27.150s sys 0m32.668s ``` The tradeoffs: - anyone with a custom Lua profile will need to update it, although the changes are fairly mechanical - Tilemaker now reserves several functions in the global namespace, causing the potential for conflicts

Building a std::map for tags is somewhat expensive, especially when we know that the number of tags is usually quite small. Instead, use a custom structure that does a crappy-but-fast hash to put the keys/values in one of 16 buckets, then linear search the bucket. For GB, before: ``` real 1m11.507s user 16m49.604s sys 0m17.381s ``` After: ``` real 1m9.557s user 16m28.826s sys 0m17.937s ``` Saving 2 seconds of wall clock and 20 seconds of user time doesn't seem like much, but (a) it's not nothing and (b) having the tags in this format will enable us to thwart some of Lua's defensive copies in a subsequent commit. A note about the hash function: hashing each letter of the string using boost::hash_combine eliminated the time savings.

We (ab?)use kaguya's parameter serialization machinery. Rather than take a `std::string`, we take a `KnownTagKey` and teach Lua how to convert a Lua string into a `KnownTagKey`. This avoids the need to do a defensive copy of the string when coming from Lua. It provides a modest boost: ``` real 1m8.859s user 16m13.292s sys 0m18.104s ``` Most keys are short enough to fit in the small-string optimization, so this doesn't help us avoid mallocs. An exception is `addr:housenumber`, which, at 16 bytes, exceeds g++'s limit of 15 bytes. It should be possible to also apply a similar trick to the `Attribute(...)` functions, to avoid defensive copies of strings that we've seen as keys or values.

After: ``` real 1m8.124s user 16m6.620s sys 0m16.808s ``` Looks like we're solidly into diminishing returns at this point.

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

For GB, ~0.3% of objects are visible at low zooms. I noticed in previous planet runs that fetching the objects for tiles in the low zooms was quite slow - I think it's because we're scanning 1.3B objects each time, only to discard most of them. Now we'll only be scanning ~4M objects per tile, which is still an absurd number, but should mitigate most of the speed issue without having to properly index things. This will also help us maintain performance for memory-constrained users, as we won't be scanning all 15GB of data on disk, just a smaller ~45MB chunk.

For Points stored via Layer(...) calls, store the node ID in the OSM store, unless `--materialize-geometries` is present. This saves ~200MB of RAM for North America, so perhaps 1 GB for the planet if NA has similar characteristics as the planet. Also fix the OSM_ID(...) macro - it was lopping off many more bits than needed, due to some previous experiments. Now that we want to track nodes, we need at least 34 bits. This may pose a problem down the road when we try to address thrashing. The mechanism I hoped to use was to divide the OSM stores into multiple stores covering different low zoom tiles. Ideally, we'd be able to recall which store to look in -- but we only have 36 bits, we need 34 to store the Node ID, so that leaves us with 1.5 bits => can divide into 3 stores. Since the node store for the planet is 44GB, dividing into 3 stores doesn't give us very much headroom on a 32 GB box. Ah well, we can sort this out later.

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

Not used by anything yet. Given Tilemaker's limited needs, we can get away with a stripped-down string class that is less flexible than std::string, in exchange for memory savings. The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang). When it does allocate (for strings longer than 15 bytes), it allocates from a pool so there's less per-allocation overhead.

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

Break dependency on AttributePair, just work on std::string

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

AttributePair has now been trimmed from 48 bytes to 18 bytes. There are 40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs, so hopefully this is a savings of ~900MB at the planet level. Runtime doesn't seem affected. There's a further opportunity for savings if we can make more strings qualify for the short string optimization. Only about 40% of strings fit in the 15 byte short string optimization. Of the remaining 60%, many are Latin-alphabet title cased strings like `Wellington Avenue` -- this could be encoded using 5 bits per letter, saving us an allocation. Even in the most optimistic case where: - there are 30M AttributePairs - of these, 90% are strings (= 27M) - of these, 60% don't fit in SSO (=16m) - of these, we can make 100% fit in SSO ...we only save about 256MB at the planet level, but at some significant complexity cost. So probably not worth pursuing at the moment.

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

When using --store, deque is nice because growing doesn't require invalidating the old storage and copying it to a new location. However, it's also bad, because deque allocates in 512-byte chunks, which causes each 4KB OS page to have data from different z6 tiles. Instead, use our own container that tries to get the best of both worlds. Writing a random access iterator is new for me, so I don't trust this code that much. The saving grace is that the container is very limited, so errors in the iterator impelementation may not get exercised in practice.

This adds three methods to the stores: - `shard()` returns which shard you are - `shards()` returns how many shards total - `contains(shard, id)` returns whether or not shard N has an item with id X SortedNodeStore/SortedWayStore are not implemented yet, that'll come in a future commit. This will allow us to create a `ShardedNodeStore` and `ShardedWayStore` that contain N stores. We will try to ensure that each store has data that is geographically close to each other. Then, when reading, we'll do multiple passes of the PBF to populate each store. This should let us reduce the working set used to populate the stores, at the cost of additional linear scans of the PBF. Linear scans of disk are much less painful than random scans, so that should be a good trade.

I'm going to rejig the innards of this class, so let's have some tests.

In order to shard the stores, we need to have multiple instances of the class. Two things block this currently: atomics at file-level, and thread-locals. Moving the atomics to the class is easy. Making the thread-locals per-class will require an approach similar to that adopted in https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48, where we have a container that tracks the per-class data.

Still only supports 1 class, but this is a step along the path.

D'oh, this "worked" due to two bugs cancelling each other: (a) the code to find things in the low zoom list never found anything, because it assumed a base z6 tile of 0/0 (b) we weren't returning early, so the normal code still ran Rejigged to actually do what I was intending

This has no performance impact as we never put anything in the 7th shard, and so we skip doing the 7th pass in the ReadPhase::Ways and ReadPhase::Relations phase. The benefit is only to avoid emitting a noisy log about how the 7th store has 0 entries in it. Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb The new peak memory is ~12.2GB. I am a little perplexed -- the runtime on a 16-core server was previously: ``` $ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores real 195m7.819s user 2473m52.322s sys 73m13.116s ``` But with the most recent commits on this branch, it was: ``` real 118m50.098s user 1531m13.026s sys 34m7.252s ``` This is incredibly suspicious. I also tried re-running commit bbf0957, and got: ``` real 123m15.534s user 1546m25.196s sys 38m17.093s ``` ...so I can't explain why the earlier runs took 195 min. Ideas: - the planet changed between runs, and a horribly broken geometry was fixed - Vultr gives quite different machines for the same class of server - perhaps most likely: I failed to click "CPU-optimized" when picking the earlier server, and got a slow machine the first time, and a fast machine the second time. I'm pretty sure I paid the same $, so I'm not sure I believe this. I don't think I really believe that a 33% reduction in runtime is explained by any of those, though. Anyway, just another thing to be befuddled by.

On a 48-core machine, I still see lots of lock contention. AttributeStore:add is one place. Add a thread-local cache that can be consulted without taking the shared lock. The intuition here is that there are 1.3B objects, and 40M attribute sets. Thus, on average, an attribute set is reused 32 times. However, average is probably misleading -- the distribution is likely not uniform, e.g. the median attribute set is probably reused 1-2 times, and some exceptional attribute sets (e.g. `natural=tree` are reused thousands of times). For GB on a 16-core machine, this avoids 27M of 36M locks.

On a 48-core machine, this phase currently achieves only 400% CPU usage, I think due to these locks

This reverts commit e872073. This didn't seem to be a win - less system time, but more overall CPU time. Let's fix the bigger contention issues, and consider revisiting this. In fact, AttributePairStore::getPair is only called by removePairWithKey. We could rejig OsmLuaProcessing to do this filtering prior to creating an AttributeSet - then there's no need for locks at all.

I did some experiments on a Hetzner 48-core box with 192GB of RAM: --store, materialize geometries: real 65m34.327s user 2297m50.204s sys 65m0.901s The process often failed to use 100% of CPU--if you naively divide user+sys/real you get ~36, whereas the ideal would be ~48. Looking at stack traces, it seemed to coincide with calls to Boost's rbtree_best_fit allocator. Maybe: - we're doing disk I/O, and it's just slower than recomputing the geometries - we're using the Boost mmap library suboptimally -- maybe there's some other allocator we could be using. I think we use the mmap allocator like a simple bump allocator, so I don't know why we'd need a red-black tree --store, lazy geometries: real 55m33.979s user 2386m27.294s sys 23m58.973s Faster, but still some overhead (user+sys/real => ~43) no --store, materialize geometries: OOM no --store, lazy geometries (used 175GB): real 51m27.779s user 2306m25.309s sys 16m34.289s This was almost 100% CPU - user+sys/real => ~45) From this, I infer: - `--store` should always default to lazy geometries in order to minimize the I/O burden - `--materialize-geometries` is a good default for non-store usage, but it's still useful to be able to override and use lazy geometries, if it then means you can fit the data entirely in memory

Switch from libprotobuf to vtzero, and fix a few places where we copied the geometries during writing. GB, lua-interop: real 1m51.572s user 26m45.201s sys 0m16.575s GB, this branch: real 1m43.007s user 25m0.658s sys 0m15.263s I haven't looked too closely at the output tiles yet, but they seem correct. Still todo: revive support for --merge

cldellow · 2023-12-28T20:20:33Z

Bah, git isn't very happy about merging master's squash commits. I'm going to re-open this against master with the relevant commits cherry-picked.

cldellow added 30 commits December 4, 2023 17:34

tweak names

3922170

avoid malloc for Attribute with long strings

13b3465

After: ``` real 1m8.124s user 16m6.620s sys 0m16.808s ``` Looks like we're solidly into diminishing returns at this point.

fix make

8c179d1

move OutputObjects to mmap store

5f30a30

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

make more explicit that this is unexpected

d7caf10

rejig AttributePair layout

b86fddc

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

fix initialization order warning

a54938e

add tests for attribute store

3eb07c2

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

rejig isHot

b3eac99

Break dependency on AttributePair, just work on std::string

teach PooledString to work with std::string

2784903

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

log timings

9394bc7

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

fix progress when --store present

330b0a7

mutex on RelationScan progress output

9d97d30

add minimal SortedNodeStore test

b49b1e7

I'm going to rejig the innards of this class, so let's have some tests.

SortedNodeStore: abstract TLS behind storage()

1c4174d

Still only supports 1 class, but this is a step along the path.

SortedWayStore: abstract TLS behind storage()

99b5912

SortedNodeStore: support multiple instances

f225ebd

SortedWayStorage: support multiple instances

6c7917b

AppendVector tweaks

24b73f1

cldellow added 25 commits December 24, 2023 13:06

fixes

3d89a78

Merge branch 'planet-on-32gb' into protozero

2ae5d69

Merge branch 'planet-on-32gb' into protozero

181c3c7

Merge branch 'protozero' into lua-interop

0d5e28f

RelationScanStore: more granular locks

c4631a3

On a 48-core machine, this phase currently achieves only 400% CPU usage, I think due to these locks

AttributePairStore.getPair: add thread-local cache

e872073

move duplicate attribute handling outside of locks

1c16fae

add thread-local cache for attributepairs

0584403

buffer objects when object index contended

fd96943

Merge branch 'planet-on-32gb' into protozero

576fd37

fix mac/windows build?

916ff31

Merge branch 'protozero' into lua-interop

fe8840d

fixes for Windows build

4901415

fix mac build?

6d491fc

more macos fixes

decd18b

Merge branch 'protozero' into lua-interop

bf29ff0

read_pbf -> pbf_processor

867fd1f

Merge branch 'protozero' into lua-interop

9594df1

remove libprotobuf

accddde

revive --merge support

68aa85f

cldellow marked this pull request as draft December 27, 2023 22:47

systemed mentioned this pull request Dec 28, 2023

3.0 release planning #622

Closed

18 tasks

cldellow closed this Dec 28, 2023

cldellow mentioned this pull request Dec 28, 2023

use vtzero instead of libprotobuf #625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use vtzero instead of libprotobuf #624

use vtzero instead of libprotobuf #624

cldellow commented Dec 27, 2023 •

edited

Loading

cldellow commented Dec 28, 2023

use vtzero instead of libprotobuf #624

use vtzero instead of libprotobuf #624

Conversation

cldellow commented Dec 27, 2023 • edited Loading

cldellow commented Dec 28, 2023

cldellow commented Dec 27, 2023 •

edited

Loading