-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192
Fix: Linestrings that follow the same path but where one contains extra redundant points are not deduplicated #192
Conversation
Interesting approach. So, |
For the test file, processing time increases from 14.5 s to 16.5 s.
|
I had a quick further look at point 3:
It doesn't seem possible to avoid conversions without rewriting quite some code. And then the question is if it is really going to be faster... So it doesn't seem like "short term, low hanging fruit"... |
…ng-same-path-not-deduplicated
I'm not sure what this test is testing... so difficult to judge if the asserts should be this strict?
I implemented a numpy function to remove the collinear points. It has similar performance to simplify(0), but the extra conversions are avoided so the test file can now be processed in +- 15.2 s. I also had a look at the simplify function in ops, but:
|
Yes I think a numpy approach would work as well, but from what I understand now it's still being done line by line. |
I suppose it should be possible, but my search on the subject didn't bring up a lot of ideas vor vectorized solutions... |
One way (as you said before, there is more than one way to Rome),
|
Don't quite know why I didn't sooner of it, but the current remove_collinear_points was actually quite easy to vectorize... Profiler says remove_collinear_points now takes 0.28 seconds for the test file, versus 0.382 with the unvectorized version... The total timing didn't really change, but probably that's because there is some fluctuation on the performance of my develop machine (it's on shared infrasteructure). |
No problem! Once you get used to think in adding another dimension it's becoming more fun;). I setup another branch to test integration continuous benchmarking as part of GitHub Actions, by being inspired of your other PR. This would give also some timing for different type of files as part of each PR. |
Clarification: it is vectorized per linestrings, not for all lines in one go. Will be possible as well, but would need quite ugly code with sparse arrays... |
Sounds quite cool... I wonder how stable the results will be performance-wise on the github infrastructure, but if it reasonable it would be great. Sadly I won't be able to "borrow" the idea for geofileops, because that project is focused on parallellizing spatial operations, so you need a reasonable amount of CPU's to properly benchmark it... |
Ragged arrays are indeed not great. https://github.com/mattijn/topojson/blob/master/topojson/ops.py#L513 Like this I can get most out of numpy. When I find some time I can have a look as well. |
Yeah... certainly memory-footprint wise this will be quite heavy for larger files... so doesn't sound very scalable. Performance-wise it also creates quite some overhead... but possibly the looping is that bad that it still compensates...
Feel free! |
I'm fine with this as is. Thanks again! |
Generated benchmark image: |
Fixes #191