be able to render the planet with 32gb of RAM #618

cldellow · 2023-12-18T23:49:19Z

This PR lets Tilemaker build the planet on smaller machines.

On a Vultr 16-core, 32GB, 500GB SSD machine:

$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores
real	195m7.819s
user	2473m52.322s
sys	73m13.116s

Runtime for non-memory constrained boxes isn't affected, e.g. on a Hetzner 48-core, 192 GB machine:

$ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles
real	65m20.082s
user	2570m33.530s
sys	41m15.420s

On a $ basis, if you're renting a machine to do the work, it's cheaper to use a bigger box. But for folks who need to use what they already have, this may be a useful PR.

The changes are a mix of using less memory, spilling more things to disk, and thrashing less when things are backed by disk.

Using less memory:

~1GB: extend --materialize-geometries to points -- points from Layer(...) can be looked up in the NodeStore. LayerAsCentroid(...) still needs the point store
~1.5GB: rejig AttributePair
- eliminate padding
- use a union for the string and float values
- replace std::string with PooledString
~4GB: use a custom container (AppendVector) rather than a vector of vectors for storing OutputObjects
- vector's grow-by-doubling behaviour results in some wasted memory. I initially tried to replace it with a deque, but deque's 512-byte allocation size results in poor locality on the disk

Spill more things to disk:

~12GB: the OutputObjects now spill to disk when --store is used

Thrash less:

materialize the list of low-zoom objects, so that we only scan the list of 1.3B output objects a single time, not 1,365 times
compute the set of tiles with objects simultaneously for all zooms, so that we only scan the list of 1.3B output objects a single time, not 15 times
when --shard-stores is set, split the NodeStore and WayStore into 7 stores that cover different parts of the globe
- the idea is to have roughly equal sized splits in terms of nodes/ways/relations, I started with a best guess than iterated a couple of times based on memory usage reported by the stores: https://geojson.io/#id=gist:cldellow/00d9d9d627494c522c31fc5a63909749
- in this mode, ReadPhase::Ways will run 7 times, populating a single WayStore on each pass. Only those ways whose first node is in the corresponding NodeStore get populated. Because nodes in ways generally are geographically near each other, we'll mostly be accessing a single NodeStore to process the way. That NodeStore fits into memory for the duration of the pass, avoiding disk I/O.
- ReadPhase::Relations behaves similarly, using the ID of the first way to decide whether to process the relation.
- when writing, since we group by z6 tile, we'll have long runs that use the same stores, which means we'll only need to do new disk I/O when the writer starts a new region

Potential future improvements:

We still need ~14GB of RAM to read everything. It might be worthwhile to try to account for all of it, 14GB feels excessive. Possible culprits: protobuf reader (~160MB/core, I think), attribute store and friends, the r-tree index for large items.
The sharding is tuned for the planet on a 32GB box. Being able to dynamically pick the shards based on the bounding box and actual memory available could be useful.
The runtime benefit of multiple passes for relations is thwarted a bit by straggler relations that take an abnormally long time to process (Antarctica, Hudson Bay, etc). If we could cheaply identify the blocks that have such relations, we could start processing them earlier in the hopes that they'd be done by the time we were done the other relations.
- ...actually, maybe it's the boost thread pool more generally? You see a similar effect when reading ways. I don't know how it works under the covers - maybe threads grab a batch of tasks at once, and you end up with a single thread hoarding some work items while the rest of the threads starve. If that's the case, a task-stealing approach might get better utilization towards the end of the work queue.

These are mostly smaller issues that can be happily ignored forever, just wanted to write them down so I can forget about them.

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

For GB, ~0.3% of objects are visible at low zooms. I noticed in previous planet runs that fetching the objects for tiles in the low zooms was quite slow - I think it's because we're scanning 1.3B objects each time, only to discard most of them. Now we'll only be scanning ~4M objects per tile, which is still an absurd number, but should mitigate most of the speed issue without having to properly index things. This will also help us maintain performance for memory-constrained users, as we won't be scanning all 15GB of data on disk, just a smaller ~45MB chunk.

For Points stored via Layer(...) calls, store the node ID in the OSM store, unless `--materialize-geometries` is present. This saves ~200MB of RAM for North America, so perhaps 1 GB for the planet if NA has similar characteristics as the planet. Also fix the OSM_ID(...) macro - it was lopping off many more bits than needed, due to some previous experiments. Now that we want to track nodes, we need at least 34 bits. This may pose a problem down the road when we try to address thrashing. The mechanism I hoped to use was to divide the OSM stores into multiple stores covering different low zoom tiles. Ideally, we'd be able to recall which store to look in -- but we only have 36 bits, we need 34 to store the Node ID, so that leaves us with 1.5 bits => can divide into 3 stores. Since the node store for the planet is 44GB, dividing into 3 stores doesn't give us very much headroom on a 32 GB box. Ah well, we can sort this out later.

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

Not used by anything yet. Given Tilemaker's limited needs, we can get away with a stripped-down string class that is less flexible than std::string, in exchange for memory savings. The key benefits - 16 bytes, not 32 bytes (g++) or 24 bytes (clang). When it does allocate (for strings longer than 15 bytes), it allocates from a pool so there's less per-allocation overhead.

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

Break dependency on AttributePair, just work on std::string

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

AttributePair has now been trimmed from 48 bytes to 18 bytes. There are 40M AttributeSets for the planet. That suggests there's probably ~30M AttributePairs, so hopefully this is a savings of ~900MB at the planet level. Runtime doesn't seem affected. There's a further opportunity for savings if we can make more strings qualify for the short string optimization. Only about 40% of strings fit in the 15 byte short string optimization. Of the remaining 60%, many are Latin-alphabet title cased strings like `Wellington Avenue` -- this could be encoded using 5 bits per letter, saving us an allocation. Even in the most optimistic case where: - there are 30M AttributePairs - of these, 90% are strings (= 27M) - of these, 60% don't fit in SSO (=16m) - of these, we can make 100% fit in SSO ...we only save about 256MB at the planet level, but at some significant complexity cost. So probably not worth pursuing at the moment.

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

When using --store, deque is nice because growing doesn't require invalidating the old storage and copying it to a new location. However, it's also bad, because deque allocates in 512-byte chunks, which causes each 4KB OS page to have data from different z6 tiles. Instead, use our own container that tries to get the best of both worlds. Writing a random access iterator is new for me, so I don't trust this code that much. The saving grace is that the container is very limited, so errors in the iterator impelementation may not get exercised in practice.

This adds three methods to the stores: - `shard()` returns which shard you are - `shards()` returns how many shards total - `contains(shard, id)` returns whether or not shard N has an item with id X SortedNodeStore/SortedWayStore are not implemented yet, that'll come in a future commit. This will allow us to create a `ShardedNodeStore` and `ShardedWayStore` that contain N stores. We will try to ensure that each store has data that is geographically close to each other. Then, when reading, we'll do multiple passes of the PBF to populate each store. This should let us reduce the working set used to populate the stores, at the cost of additional linear scans of the PBF. Linear scans of disk are much less painful than random scans, so that should be a good trade.

I'm going to rejig the innards of this class, so let's have some tests.

In order to shard the stores, we need to have multiple instances of the class. Two things block this currently: atomics at file-level, and thread-locals. Moving the atomics to the class is easy. Making the thread-locals per-class will require an approach similar to that adopted in https://github.com/systemed/tilemaker/blob/52b62dfbd5b6f8e4feb6cad4e3de86ba27874b3a/include/leased_store.h#L48, where we have a container that tracks the per-class data.

Still only supports 1 class, but this is a step along the path.

D'oh, this "worked" due to two bugs cancelling each other: (a) the code to find things in the low zoom list never found anything, because it assumed a base z6 tile of 0/0 (b) we weren't returning early, so the normal code still ran Rejigged to actually do what I was intending

Do a single pass, rather than one pass per zoom.

This distributes nodes into one of 8 shards, trying to roughly group parts of the globe by complexity. This should help with locality when writing tiles. A future commit will add a ShardedWayStore and teach read_pbf to read in a locality-aware manner, which should help when reading ways.

systemed · 2023-12-20T17:18:13Z

Using the old (mid-2021) planet I've run previous tests with, and including shapefiles, memory consumption was 18.2GB - which is amazing. Total time 5hr39. (Before this PR it was 5hr12 and 40.2GB.)

Comparing with Europe, that suggests a very rough estimated RAM requirement of one-third the .osm.pbf size.

systemed · 2023-12-21T17:34:33Z

Played with this a bit more today and still impressed. Also thanks for the copious comments which help me to understand what's going on!

I think the only suggestion I'd make is that we now have a fairly broad array of performance options (--no-compress-nodes, --no-compress-ways, --materialize-geometries, --shard-stores, plus of course --store and --compact have performance implications). I suspect most users won't understand which to pick.

I guess there are three common scenarios:

Small extract (do everything in memory)
Planet or large extract on expansive hardware (use store and optimise for run-time)
Constrained hardware (use store and optimise for RAM consumption)

These could perhaps be represented by the following run-time options:

(no flags specified)
--store /path/to/ssd --fast (equivalent of --materialize-geometries on, --shard-stores off)
--store /path/to/ssd (equivalent of --materialize-geometries off, --shard-stores on)

We can then simply tell people "if you have lots of memory and are working with a big extract, use the --fast option".

We can still retain the granular controls, but maybe put them in a separate "performance tuning" option group.

It turns out that about 20% of LayerAsCentroid calls are for nodes, which this branch could already do. The remaining calls are predominantly ways, e.g. housenumbers. We always materialize relation centroids, as they're expensive to compute. In GB, this saves about 6.4M points, ~102M. Scaled to the planet, it's perhaps a 4.5GB savings, which should let us use a more aggressive shard strategy. It seems to add 3-4 seconds to the time to process GB.

cldellow · 2023-12-22T05:16:19Z

Yes, good call on the flags and de-emphasizing the individual knobs. I'll make that change.

This implements the idea in systemed#622 (comment) Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`, store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per AttributePair and AttributeSet.

Seems to save ~1.5 seconds on GB

Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is 3.7GB. Even ignoring the memory savings in the recent commits, these could be merged.

We'd like to have different defaults based on whether `--store` is present. Now that option parsing will have some more complex logic, let's pull it into its own class so it can be more easily tested.

This has no performance impact as we never put anything in the 7th shard, and so we skip doing the 7th pass in the ReadPhase::Ways and ReadPhase::Relations phase. The benefit is only to avoid emitting a noisy log about how the 7th store has 0 entries in it. Timings with 6 shards on Vultr's 16-core machine here: https://gist.github.com/cldellow/77991eb4074f6a0f31766cf901659efb The new peak memory is ~12.2GB. I am a little perplexed -- the runtime on a 16-core server was previously: ``` $ time tilemaker --store /tmp/store --input planet-latest.osm.pbf --output tiles.mbtiles --shard-stores real 195m7.819s user 2473m52.322s sys 73m13.116s ``` But with the most recent commits on this branch, it was: ``` real 118m50.098s user 1531m13.026s sys 34m7.252s ``` This is incredibly suspicious. I also tried re-running commit bbf0957, and got: ``` real 123m15.534s user 1546m25.196s sys 38m17.093s ``` ...so I can't explain why the earlier runs took 195 min. Ideas: - the planet changed between runs, and a horribly broken geometry was fixed - Vultr gives quite different machines for the same class of server - perhaps most likely: I failed to click "CPU-optimized" when picking the earlier server, and got a slow machine the first time, and a fast machine the second time. I'm pretty sure I paid the same $, so I'm not sure I believe this. I don't think I really believe that a 33% reduction in runtime is explained by any of those, though. Anyway, just another thing to be befuddled by.

I did some experiments on a Hetzner 48-core box with 192GB of RAM: --store, materialize geometries: real 65m34.327s user 2297m50.204s sys 65m0.901s The process often failed to use 100% of CPU--if you naively divide user+sys/real you get ~36, whereas the ideal would be ~48. Looking at stack traces, it seemed to coincide with calls to Boost's rbtree_best_fit allocator. Maybe: - we're doing disk I/O, and it's just slower than recomputing the geometries - we're using the Boost mmap library suboptimally -- maybe there's some other allocator we could be using. I think we use the mmap allocator like a simple bump allocator, so I don't know why we'd need a red-black tree --store, lazy geometries: real 55m33.979s user 2386m27.294s sys 23m58.973s Faster, but still some overhead (user+sys/real => ~43) no --store, materialize geometries: OOM no --store, lazy geometries (used 175GB): real 51m27.779s user 2306m25.309s sys 16m34.289s This was almost 100% CPU - user+sys/real => ~45) From this, I infer: - `--store` should always default to lazy geometries in order to minimize the I/O burden - `--materialize-geometries` is a good default for non-store usage, but it's still useful to be able to override and use lazy geometries, if it then means you can fit the data entirely in memory

cldellow · 2023-12-26T04:30:54Z

Hopefully you ignored the noise of my commits during Christmas! :) Please don't feel any urgency to do anything with this or the other PRs I'll open this week -- this is just my version of tinkering with trains in the basement over the holidays.

Since my last comment:

I implemented the memory saving idea in 3.0 release planning #622 (comment)
reduced number of passes to 6 when running in memory-constrained mode
refactored options parsing and implemented the spirit of your comment in be able to render the planet with 32gb of RAM #618 (comment)

I did some benchmarking [1] and observed that the logic should maybe be:

default to everything in memory, materialized geometries
- but let a user override with --lazy-geometries, e.g. in the case where lazy geometries is enough to let you avoid needing --store
if --store is passed, default to lazy geometries
- but let a user override with --materialize-geometries if they have really, really fast SSDs

The --help after this commit:

tilemaker v2.4.0
Convert OpenStreetMap .pbf files into vector tiles

Available options:
  --help                       show help message
  --input arg                  source .osm.pbf file
  --output arg                 target directory or .mbtiles/.pmtiles file
  --bbox arg                   bounding box to use if input file does not have 
                               a bbox header set, example: 
                               minlon,minlat,maxlon,maxlat
  --merge                      merge with existing .mbtiles (overwrites 
                               otherwise)
  --config arg (=config.json)  config JSON file
  --process arg (=process.lua) tag-processing Lua file
  --verbose                    verbose error output
  --skip-integrity             don't enforce way/node integrity
  --log-tile-timings           log how long each tile takes

Performance options:
  --store arg                  temporary storage for node/ways/relations data
  --fast                       prefer speed at the expense of memory
  --compact                    use faster data structure for node lookups
                               NOTE: This requires the input to be renumbered 
                               (osmium renumber)
  --no-compress-nodes          store nodes uncompressed
  --no-compress-ways           store ways uncompressed
  --lazy-geometries            generate geometries from the OSM stores; uses 
                               less memory
  --materialize-geometries     materialize geometries; uses more memory
  --shard-stores               use an alternate reading/writing strategy for 
                               low-memory machines
  --threads arg (=0)           number of threads (automatically detected if 0)

[1]: Details in 657da1a - it wasn't quite this branch, it was this branch + protobuf + lua-interop

systemed · 2023-12-28T15:13:46Z

All working really well! Ready to merge, do you think?

Running this PR with Great Britain on my usual box:

/usr/bin/time -v tilemaker --input /media/data1/planet/great-britain-latest.osm.pbf --output ~/tm_debug/gb5.mbtiles
	Elapsed (wall clock) time (h:mm:ss or m:ss): 4:59.99
	Maximum resident set size (kbytes): 12275684

/usr/bin/time -v tilemaker --input /media/data1/planet/great-britain-latest.osm.pbf --output ~/tm_debug/gb4.mbtiles --lazy-geometries
	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:16.00
	Maximum resident set size (kbytes): 9155756

It's a big memory saving (25%) for a small time penalty (5%) - so maybe we should default to --lazy-geometries, both for in-memory and --store. But I realise one could probably bikeshed this all day. :)

cldellow · 2023-12-28T15:22:12Z

Yup, merge away.

I have no strong views on the defaults--let me know if you'd like them changed

systemed · 2023-12-28T15:24:29Z

Merged. Thank you again - this is going to make a massive difference to users.

I'll do some experimenting with the defaults before we release 3.0 but it's not crazily urgent.

@oobayly

This fixes two issues: - use an unsigned type, so we can use the whole 9 bits and have 512 keys, not 256 - fix the bounds check in AttributeKeyStore to reflex the lower threshold that was introduced in systemed#618 Hat tip @oobayly for reporting this.

@oobayly

This fixes two issues: - use an unsigned type, so we can use the whole 9 bits and have 512 keys, not 256 - fix the bounds check in AttributeKeyStore to reflex the lower threshold that was introduced in systemed#618 Hat tip @oobayly for reporting this. Fixes systemed#750.

@oobayly

This fixes two issues: - use an unsigned type, so we can use the whole 9 bits and have 512 keys, not 256 - fix the bounds check in AttributeKeyStore to reflex the lower threshold that was introduced in systemed#618 Hat tip @oobayly for reporting this.

@oobayly

This fixes two issues: - use an unsigned type, so we can use the whole 9 bits and have 512 keys, not 256 - fix the bounds check in AttributeKeyStore to reflex the lower threshold that was introduced in #618 Hat tip @oobayly for reporting this.

cldellow added 30 commits December 15, 2023 21:36

move OutputObjects to mmap store

5f30a30

For the planet, we need 1.3B output objects, 12 bytes per, so ~15GB of RAM.

make more explicit that this is unexpected

d7caf10

rejig AttributePair layout

b86fddc

On g++, this reduces the size from 48 bytes to 34 bytes. There aren't _that_ many attribute pairs, even on the planet scale, but this plus a better encoding of string attributes might save us ~2GB at the planet level, which is meaningful for a 32GB box

fix initialization order warning

a54938e

add tests for attribute store

3eb07c2

...I'm going to replace the string implementation, so let's have some backstop to make sure I don't break things

rejig isHot

b3eac99

Break dependency on AttributePair, just work on std::string

teach PooledString to work with std::string

2784903

...this will be useful for doing map lookups when testing if an AttributePair has already been created with the given value.

log timings

9394bc7

When doing the planet, especially on a box with limited memory, there are long periods with no output. Show some output so the user doesn't think things are hung. This also might be useful in detecting perf regressions more granularly.

fix progress when --store present

330b0a7

mutex on RelationScan progress output

9d97d30

add minimal SortedNodeStore test

b49b1e7

I'm going to rejig the innards of this class, so let's have some tests.

SortedNodeStore: abstract TLS behind storage()

1c4174d

Still only supports 1 class, but this is a step along the path.

SortedWayStore: abstract TLS behind storage()

99b5912

SortedNodeStore: support multiple instances

f225ebd

SortedWayStorage: support multiple instances

6c7917b

AppendVector tweaks

24b73f1

more low zoom fixes

2a05365

implement SortedNodeStore::contains

00bb73b

implement SortedWayStore::contains

e8be59c

use TileCoordinatesSet

792d1b3

faster covered tile enumeration

2df3081

Do a single pass, rather than one pass per zoom.

systemed mentioned this pull request Dec 21, 2023

3.0 release planning #622

Closed

18 tasks

cldellow added 12 commits December 23, 2023 17:26

add DequeMap, change AttributeStore to use it

d6d3f0e

This implements the idea in systemed#622 (comment) Rather than storing a `deque<T>` and a `flat_map<T*, uint32_t>`, store a `deque<T>` and `vector<uint32_t>`, to save 8 bytes per AttributePair and AttributeSet.

Merge remote-tracking branch 'origin/master' into planet-on-32gb

b234123

capture s(this)

f22cfdf

Seems to save ~1.5 seconds on GB

fix warning

efd66bb

fix warning, really

db89f8b

fewer shards

60e5261

Shard 1 (North America) is ~4.8GB of nodes, shard 4 (some of Europe) is 3.7GB. Even ignoring the memory savings in the recent commits, these could be merged.

extract option parsing to own file

09abd3a

We'd like to have different defaults based on whether `--store` is present. Now that option parsing will have some more complex logic, let's pull it into its own class so it can be more easily tested.

use sensible defaults based on presence of --store

48305a4

improve test coverage

411b71e

fixes

3d89a78

cldellow mentioned this pull request Dec 26, 2023

use protozero #623

Merged

systemed merged commit d62c480 into systemed:master Dec 28, 2023
5 checks passed

jtafarrelly mentioned this pull request Apr 18, 2024

Planet generation issues on hardware-constrained machine #703

Open

cldellow mentioned this pull request Sep 9, 2024

Bug: SegFault caused by invalid string #750

Closed

cldellow mentioned this pull request Sep 21, 2024

Fix #750: allow no more than 512 attribute names #760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

be able to render the planet with 32gb of RAM #618

be able to render the planet with 32gb of RAM #618

cldellow commented Dec 18, 2023

systemed commented Dec 20, 2023

systemed commented Dec 21, 2023

cldellow commented Dec 22, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 28, 2023

cldellow commented Dec 28, 2023

systemed commented Dec 28, 2023

be able to render the planet with 32gb of RAM #618

be able to render the planet with 32gb of RAM #618

Conversation

cldellow commented Dec 18, 2023

systemed commented Dec 20, 2023

systemed commented Dec 21, 2023

cldellow commented Dec 22, 2023

cldellow commented Dec 26, 2023

systemed commented Dec 28, 2023

cldellow commented Dec 28, 2023

systemed commented Dec 28, 2023