Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent overlong tag strings #26

Merged
merged 15 commits into from
Dec 5, 2022
Merged

Prevent overlong tag strings #26

merged 15 commits into from
Dec 5, 2022

Conversation

SlowMo24
Copy link

@SlowMo24 SlowMo24 commented Dec 2, 2022

When merging tags of duplicate elements, the tag string may exceed the maximum allowed number of 255 characters in OSM. I guess this could also happen, if the original source has more than 255 character attributes?

This PR prevents such tags by replacing overlong tags with .... I'm not sure if the static vars (PLACEHOLDER and MAX_TAG_LENGTH are stored at a reasonable location. Feel free to request any changes!

This PR additionally add info on how to run tests. I have run the test and they succeed.

This PR originates from this tool, where I ran into the overlong string issue here.

PS: Very nice tool you (all) created!

@SlowMo24 SlowMo24 mentioned this pull request Dec 2, 2022
@roelderickx
Copy link
Owner

This is a problem I encountered myself while testing the implementation of #7 and I have seen it being a problem for a few users before. Most of these users use a translation file to avoid merging, which results in invalid OSM files but it may be acceptable if you understand what you're doing.
pnorman's version solved it by keeping the tags of the first geometry and ignoring all others. That's of course not an ideal solution either - and it doesn't guarantee the strings are less than 255 characters.

I agree we need a solution for this, but I wasn't really sure what road to take. There are several solutions:

  • Hardcode the maximum length of 255 characters as you did
  • Introduce a parameter to set the maximum length (default = 255, 0 = unlimited)
  • Require users to write a translation file, or to use one which is already provided

For the last solution I was considering providing a number of predefined translation files in the ogr2osm-translations project for some general use-cases, limiting the length could be one of them. The downside is that the translation files are difficult to use, it requires users to be programmers.

So maybe I favour the second solution, and I'd add it in TranslationBase.merge_tags to make sure any new DataWriter outputs the correct value. In that case the performance is to be verified though.

@SlowMo24
Copy link
Author

SlowMo24 commented Dec 2, 2022

Most of these users use a translation file to avoid merging

Thats also what I did but the result not only becomes invalid, it becomes broken (closed lines will no longer be closed). osm2pgsql still accepts it, but half of the polygons will be lines and you wonder why. So I adapted the translation file to handle the cut-off: https://gitlab.gistools.geog.uni-heidelberg.de/giscience/deleted_map/-/blob/main/translations/processing.py (attention, ugly code ;-) ). Yet, I feel that this is not the right location to do so, because it is more cumbersome than the proposed fix in this PR.

Thats why I would also go for option 2

Introduce a parameter to set the maximum length (default = 255, 0 = unlimited)

If I can support you, feel free to ask but you know the code much better than me so...

When implementing this I also suggest considering #28 (which can also be considered a breaking change).

@roelderickx
Copy link
Owner

So I adapted the translation file to handle the cut-off: https://gitlab.gistools.geog.uni-heidelberg.de/giscience/deleted_map/-/blob/main/translations/processing.py (attention, ugly code ;-) ). Yet, I feel that this is not the right location to do so, because it is more cumbersome than the proposed fix in this PR.

It is indeed cumbersome. I was hoping to gain some performance once the string is too long, there would be no need to reconstruct the whole tags dictionary then. But on the other hand there is a lot of overhead while testing for the length. I'll have to dig somewhat deeper and test a few things out first, it will take some time. Since it does not alter the functionality compared to how it works now in your PR, it is better to do this later.

Meanwhile I added a new parameter --max-tag-value-length in your branch, the test file you provided works with the default value of 255. I didn't test any other value yet 🙂 The documentation should still be updated for this parameter, both in the standalone and in the library section.

I'll think about the location of the PLACEHOLDER constant. On one hand it is fine, but I don't really like the import of DataWriterBase in osm_geometries.py. In the PbfDataWriter class there are some constants defined as class variables, but fair enough, that's not how it should be done either.

@SlowMo24
Copy link
Author

SlowMo24 commented Dec 2, 2022

hey, thanks for taking this on. These commits look great.

there would be no need to reconstruct the whole tags dictionary then

An implementation after the full tag has been constructed (as it is in this PR) makes it a bit easier to cut the string at an arbitrary location (instead of on a per-value basis).

I tried to update the documentation. Feel free to comment/adapt.

0 disables the limit

I added that functionality (think it was missing before?).

In that case the performance is to be verified though

I collected some anecdotal evidence with a Geopackage of roughly 5mio elements and I saw no significant time difference. I think its only a single computation against all tags if length(tag)>max, which should not have so much impact. But I could be wrong.

#/usr/bin/time -a -o time.txt python -m ogr2osm --pbf -f deleted_old.gpkg

# current main
# new file
150.24user 3.38system 2:33.89elapsed 99%CPU (0avgtext+0avgdata 4997012maxresident)k
0inputs+69624outputs (0major+1282542minor)pagefaults 0swaps

# overwrite existing file
151.03user 3.42system 2:34.77elapsed 99%CPU (0avgtext+0avgdata 4998060maxresident)k
0inputs+69536outputs (0major+1282517minor)pagefaults 0swaps

# this PR
# new file (0.4% slower)
151.22user 3.29system 2:34.63elapsed 99%CPU (0avgtext+0avgdata 4991952maxresident)k
0inputs+68352outputs (0major+1281076minor)pagefaults 0swaps

# overwrite existing file (0.2% slower)
151.21user 3.54system 2:35.15elapsed 99%CPU (0avgtext+0avgdata 4991900maxresident)k
0inputs+68264outputs (0major+1280698minor)pagefaults 0swaps

I didn't test any other value ye

I did, it works:

10 returns v="Descrip..." and 1000 and -3both return uncropped tags.

I don't really like the import of DataWriterBase in osm_geometries.py

I had the same feeling. Could this be a global constant? Another thing is the duplicate code between the PBF and the XML-Writer.

@roelderickx
Copy link
Owner

Thanks for your modifications. I realized too late I didn't implement the unlimited length case and the same is true for the issues with PbfDataWriter, I was clearly too tired to continue.
I thought about it today and decided to simplify the parameter name to --max-tag-length, I think it is clear enough and somewhat shorter. The import of DataWriterBase in osm_geometries.py is omitted now, at the cost of an extra parameter. It is cleaner if the class is reused elsewhere, which is great in theory but of course never happens 🙂 . I did some cleanup as well.
The tests pass, if you don't see any further issues then I'll merge the pull request.

@SlowMo24
Copy link
Author

SlowMo24 commented Dec 4, 2022

LGTM, thanks for the friendly collaboration!

@roelderickx roelderickx merged commit e9efa1a into roelderickx:main Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants