Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some memory and concurrency improvements #612

Merged
merged 24 commits into from
Dec 15, 2023

Conversation

cldellow
Copy link
Contributor

@cldellow cldellow commented Dec 14, 2023

This PR has a few things:

  • housekeeping -- tidy up where some files live, extract ClipCache to its own file
  • reduce lock contention
  • use ~33% less memory, as benchmarked on GB (or, use the same amount of memory, and be ~7% faster)

The memory reduction comes from leaving most ways in the WayStore and generating the linestring or polygon on-demand at writing time. There are a few tricks to mitigate the cost of rebuilding the geometry -- we only generate on-demand those ways that didn't need to be corrected, and each thread keeps a thread-local cache of linestrings that they recently constructed.

I explored extending this to relations as well, but it seemed like there was much less opportunity with relations: each individual relation is bigger, so reconstructing it is more costly; they're more often in need of correction; and in absolute terms, they're not as big of a memory hog as all the ways. My bag of tricks wasn't able to mitigate the runtime impact. There might be something there, but I'm going to admit defeat for now.

GB, master:

9,682MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real	2m19.527s
user	31m44.574s
sys	0m25.943s

GB, this PR:

6,484MB RAM

SortedWayStore: 18614 groups, 1636695 chunks, 17554284 ways, 160743637 nodes, 650662740 bytes
Generated points: 6477007, lines: 1846, polygons: 5344375

real 2m20.764s
user 33m1.989s
sys 0m19.444s

GB, this PR with --materialize-geometries:

9,600MB RAM

SortedWayStore: 17036 groups, 116656 chunks, 381137 ways, 16588528 nodes, 49565026 bytes
Generated points: 6477007, lines: 7443584, polygons: 15168555

real	2m11.685s
user	31m1.558s
sys	0m18.525s

planet-latest, this PR on a 48-core Hetzner CCX63:

SortedNodeStore: 173473 groups, 41167239 chunks, 8773657450 nodes, 44932630766 bytes
SortedWayStore: 18746 groups, 4571385 chunks, 714744586 ways, 7912084618 nodes, 21576308504 bytes
Generated points: 204334361, lines: 56549, polygons: 229203193

real	67m51.201s
user	2717m54.920s
sys	36m53.642s

tilemaker used 42GB of RAM and 249GB of virtual memory to generate the planet -- so still a bit out of reach for people on smaller machines.

Some housekeeping: extract clip_cache.cpp
This provides a very small benefit. I think the reason is two-fold:
there aren't many multilinestrings (relative to multipolygons), and
clipping them is less expensive.

Still, it did seem to provide a small boost, so leaving it in.
This isn't super useful to end users, but is useful for developers.

If it's not OK to leave it in, let me know & I'll revert it.

You can then process the log:

```bash
$ for x in {0..14}; do echo -n "z$x "; cat log-write-node-attributes.txt  | grep ' took ' | sort -nk3  | grep z$x/ | awk 'BEGIN { min = 999999; max = 0;  }; { n += 1; t += $3; if ($3 > max) { max = $3; max_id = $1; } } END { print n, t, t/n, max " (" max_id ")" }'; done

z0 1 7.04769 7.04769 7.047685 (z0/0/0)
z1 1 9.76067 9.76067 9.760671 (z1/0/0)
z2 1 9.98514 9.98514 9.985141 (z2/1/1)
z3 1 9.98514 9.98514 9.985141 (z3/2/2)
z4 2 14.4699 7.23493 8.610035 (z4/5/5)
z5 2 20.828 10.414 13.956526 (z5/10/11)
z6 5 6464.05 1292.81 3206.252711 (z6/20/23)
z7 13 11306.4 869.727 3275.475707 (z7/40/46)
z8 35 15787.1 451.061 2857.506681 (z8/81/92)
z9 86 20723.8 240.974 1605.788985 (z9/162/186)
z10 277 25456.8 91.9018 778.311785 (z10/331/369)
z11 960 28851.3 30.0534 627.351078 (z11/657/735)
z12 3477 24031.6 6.91158 451.122972 (z12/1315/1471)
z13 13005 13763.7 1.05834 156.074701 (z13/2631/2943)
z14 50512 24214.7 0.479385 106.358450 (z14/5297/5916)
```

This shows each zoom's # of tiles, total time, average time, worst case
time (and the tile that caused it).

In general, lower zooms are slower than higher zooms. This seems
intuitively reasonable, as the lower zoom often contains all of
the objects in the higher zoom.

I would have guessed that a lower zoom would cost 4x the next higher
zoom on a per-tile basis. That's sort of the case for `z12->z13`,
`z11->z12`, `z10->z11`, and `z9->z10`. But not so for other zooms,
where it's more like a 2x cost.

Looking at `z5->z6`, we see a big jump from 10ms/tile to 1,292ms/tile.
This is probably because `water` has a minzoom of 6.

This all makes me think that the next big gain will be from re-using
simplifications.

This is sort of the mirror image of the clip cache:

- the clip cache avoids redundant clipping, and needs to be computed
  from lower zooms to higher zooms

- a simplification cache could make simplifying cheaper, but needs to
  be computed from higher zooms to lower zooms

The simplification cache also has two other wrinkles:

1. Is it even valid? e.g. is `simplify(object, 4)` the same as
   `simplify(simplify(object, 2), 2)` ? Maybe it doesn't have to be the
   same, because users are already accepting that we're losing accuracy
   when we simplify.

2. Rendering an object at `z - 1`, needds to (potentially) stitch together
   that object from 4 tiles at `z`. If those have each been simplified,
   we may introduce odd seams where the terminal points don't line up.
This saves a very little bit of time, but more importantly, tees up
lazily evaluating the nodes in a way.
Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

This didn't actually seem to affect runtime that much on my 16 core
machine, but I think it'd help on machines with more cores.
When scaling to 32+ cores, this shows up as an issue. Try a really
blunt hammer fix.
`std::cout` has some internal locks -- instead, let's synchronize
explicitly outside of it so we control the contention.

If a worker fails to get the lock, just skip that worker's update.
If a worker can't get the lock, just skip their update.
This reverts commit 3e7b9b6.

This commit came about from some experiments that I had done
pre-SortedNodeStore.

In that world, lazily evaluating the nodes of a way provided a
meaningful savings if the way was ultimately discarded by the Lua
code.

Post-SortedNodeStore, it doesn't seem to matter as much. Which is great,
as it means the store is much faster, but also means this commit is
just noise.

You can see the POC code in https://github.com/cldellow/tilemaker/tree/lazy-way-nodes
Tilemaker previously stored the 2D geometries it produced from ways.

This commit makes Tilemaker use the OSM way store to generate linestrings
and polygons that originated with an OSM way. You can get the old
behaviour with `--materialize-geometries`, which is a sensible choice if
you are not memory constrained.

For GB:

before (available via `--materialize-geometries`): 2m11s, 9600MB
this commit:  2m20s, 6400MB

So ~8% slower, but 33% less memory.

I think it's probably reasonable for this to be the default, which has
nice symmetry with compressed nodes and compressed ways being the
default.

Building NA with --store still seems OK - 36min. I was concerned that
the increased node store lookups could be vulnerable to thrashing.
I do see some stuttering during tile writing, but I think the decreased
read iops from a smaller geometry store balance out the increased
read iops from looking up nodes. A future investigation might be to
have SortedWayStore store latplons rather than node IDs -- a bit
more memory, but should be less CPU and less vulnerable to thrashing.
Before writing, we compute the set of tiles to be written.

There were two opportunities for improvement here:

- determining which tiles were covered by our objects: we previously
  used a `std::set`, which has poor memory and runtime behaviour.
  Instead, use a fixed size `std::vector<bool>` -- this takes 64MB
  at z14, but gives much faster insert/query times

- determining if tiles were covered by clipping box: we used
  boost's intersect algorithm before, which required constructing
  a TileBbox and was a bit slow. In the case where the tile is
  contained in a z6 tile that is wholly covered by the clipping
  box, we can short-circuit

This has the most impact when the set of objects or tiles is very
large--e.g. Antarctica, North America or bigger.
On a 48-core server, I noticed lock contention on the mmap allocator.

So let's just always use pools of memory, and pick a bigger pool size.

This means we'll sometimes allocate memory that we don't use.

In the extreme case of Monaco, we only need like 200KB, but we'll
allocate several megs.

As you scale to larger PBFs, the waste trends to 0%, so this should
be fine in practice.
D'oh, clock_gettime is Linux-ish. `std::chrono` may have a
cross-platform option, but it's not clear.

For now, just omit this functionality on Windows. If we want to expose
it, we can explore something in std::chrono or make a wrapper that
calls QueryPerformanceCounter on Windows.
@systemed
Copy link
Owner

That's superb!

With North America I get an even bigger memory saving - down from 20.1GB to 13.2GB (using --store and with shapefiles). I think about 4GB of that is ClipCache so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

@systemed
Copy link
Owner

On my usual machine, 40.2GB (with --store) for the full planet including the poles, 5hr12m22. Really happy with that.

It was sitting on one thread for about 10/15 minutes at the end, which I'm guessing was crunching through a particularly complex set of geometries (very possibly at the poles).

@cldellow
Copy link
Contributor Author

I think about 4GB of that is ClipCache

Oh, wow. I hadn't thought too critically about this, but I didn't expect it to use quite that much memory.

The cache size does increase with more cores, so perhaps this is expected - for 24 cores, 4 GB is "only" 160MB/core.

I don't often test with the shapefiles, I should probably start doing that, especially as I think the runtime overhead is much smaller now (and especially after #614 lands).

so there may be some opportunity to optimise further, e.g. by using an LRU cache as you mention in the comments.

I imagined the LRU cache would help with runtime, but not with memory usage. The benefit I was hoping for is that we'd more often have relevant things in the cache, but I wasn't planning on shrinking the cache.

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

It was sitting on one thread for about 10/15 minutes at the end

I also noticed this, and I noticed that GB had a similar straggler on the Hetzner box (Ubuntu 22.04).

Interestingly, the GB straggler doesn't appear when I run locally (Ubuntu 20.04) or when I ran on the Hetzner box prior to the #611 change.

So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

According to https://packages.ubuntu.com/search?keywords=libboost-, 20.04 has 1.71 and 22.04 has 1.74.

Nothing in Boost's changelog jumps out at me, though.

@systemed
Copy link
Owner

The cache might still be oversized, though. I didn't really experiment at all to find where the point of diminishing returns started, I just picked a number, observed an improvement and went with it.

I did have a very brief attempt at downsizing the cache size and being a bit pickier about what we put in it (e.g. no simple polygons with <20 vertices) but it removed the performance gains. But there may still be possibilities along those lines.

So I think I concluded that something in #611 causes the slowdown, but only on more modern Boost versions than the one in Ubuntu 20.04.

In order of likelihood:

  • adding geom::correct before simplify_combine (geom.cpp) will have some overhead. It's possible that this is pathological in the case of particularly large polygons and we should call our own make_valid (aka geometry::correct) in these cases instead, but we'd have to be careful not to reintroduce the invalid geometries this was intended to fix.
  • we moved to the newer Boost.Geometry BOOST_GEOMETRY_NO_ROBUSTNESS behaviour
  • possible that something in RemovePartsBelowSize is taking a while (this should be easy to test - just remove filter_below and filter_area from the JSON config)

@systemed
Copy link
Owner

This looks good to merge to me - ok with that?

@cldellow
Copy link
Contributor Author

Yup!

@systemed systemed merged commit 52b62df into systemed:master Dec 15, 2023
5 checks passed
@systemed
Copy link
Owner

Excellent - thank you again!

@systemed
Copy link
Owner

Quick issue you might be able to help me with...

I've been playing around with a branch that can load GeoJSON files in the same way that we use shapefiles. (GeoJSON is easy to generate, and I quite often find myself writing little Ruby scripts that turn some data or other into GeoJSON.) At the same time, I've moved the shapefile code into its own class. It's at master...geojson

On shutdown, after the mbtiles/pmtiles file has been written, it terminates (on Mac) with:

Filled the tileset with good things at /Users/richard/Maps/planet/oxfordshire.pmtiles
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

I've run valgrind over it and that shows:

==57979== Warning: set address range perms: large range [0x11fc54028, 0x12fc54058) (noaccess)
==57979== Invalid read of size 8
==57979==    at 0x100F41137: pthread_mutex_lock (in /usr/lib/system/libsystem_pthread.dylib)
==57979==    by 0x100237AAF: LeasedStore<std::__1::vector<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator>, std::__1::allocator<boost::geometry::model::multi_linestring<boost::geometry::model::linestring<boost::geometry::model::d2::point_xy<double, boost::geometry::cs::cartesian>, std::__1::vector, mmap_allocator>, std::__1::vector, mmap_allocator> > > >::~LeasedStore() (in /usr/local/bin/tilemaker)
==57979==    by 0x100C580D1: tlv_finalize (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x100D3B6AB: exit (in /usr/lib/system/libsystem_c.dylib)
==57979==    by 0x100C583DB: start (in /usr/lib/system/libdyld.dylib)
==57979==    by 0x2: ???
==57979==    by 0x104F018FE: ???
==57979==    by 0x104F01908: ???
==57979==    by 0x104F0193E: ???
==57979==  Address 0x104ef0e18 is on thread 1's stack
==57979==  67808 bytes below stack pointer
[same error two more times]

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
==57979== 
==57979== Process terminating with default action of signal 6 (SIGABRT)
==57979==    at 0x100EF0D46: __pthread_sigmask (in /usr/lib/system/libsystem_kernel.dylib)
[etc.]

So I think it's something to do with what you've described here:

* remove locks from geometry stores

Rather than locking on each store call, threads lease a range of the
ID space for points/lines/multilines/polygons. When the thread ends,
it return the lease.

This has some implications:

- the IDs are no longer guaranteed to be contiguous

- shapefiles are a bit weird, as they're loaded on the main
  thread -- so their lease won't be returned until program
  end. This is fine, just pointing it out.

...and that I've missed something obvious. I can't see what that is (but then it is 1.30am). Any thoughts?

@cldellow
Copy link
Contributor Author

I gave that a quick whirl and didn't repro a crash (granted, on Linux). If it's easy to replicate the command you're running, I can try that.

Speculation: the LeasedStore destructor is running after the ShpMemTiles destructor.

In the OSM case, the LeasedStore destructor runs when the threads for the various ReadPhases are terminated.

In this case, if the reading of GeoJSON and SHP is happening on the main tilemaker thread, it'll run when the thread ends -- which might be after ShpMemTiles has been destroyed.

The overall design is a bit fragile/footgun-y at the moment.

If that's the case, possible fixes could include:

  • ensure reading happens on threads that terminate before the main thread
  • add a destructor to TileDataSource that ensures that the LeasedStore on that thread, if any, has released its leases

@cldellow
Copy link
Contributor Author

Ah, I repro a different error if I try that branch with some random geojson file:

Reading GeoJSON geojson
Reading shapefile urban_areas
terminate called after throwing an instance of 'std::runtime_error'
  what():  fatal: no available stores to lease

If I then wrap systemed:ae1981b...systemed:a1976d2 in the world's most pointless thread pool:

void GeoJSONProcessor::read(class LayerDef &layer, uint layerNum) {
	boost::asio::thread_pool pool(1);
	boost::asio::post(pool, [&]() {
		// Parse the JSON file into a RapidJSON document
		...
	});
	pool.join();
}

it runs as expected.

I think the precise behaviour you'd get would depend on the order your shapefiles or geojson files were processed (which I guess depends on your layer orders).

So I think that confirms that the leases either need to be manually relinquished (if running on the same thread), or the reading needs to happen in a separate thread to ensure the leases get returned.

@systemed
Copy link
Owner

Thanks! That fixes it - I've used the same thread pool approach as with shapefiles and it's fine now. Will do a bit more testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants