Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellchecking warnings for certain tag names (to catch typos by users) #7754

Open
peternewman opened this issue Jun 28, 2020 · 3 comments
Open
Labels
considering Not Actionable - still considering if this is something we want

Comments

@peternewman
Copy link
Contributor

So I, and previously 90 other people, misspelt pavilion with two L's while tagging in OSM (on my part because it didn't show in the drop down of the building key, yes I probably should have use a preset or noticed it didn't flag one).

So I've created https://github.com/openstreetmap/iD/pull/7749/files which I think will fix the outstanding ones.

I've also done a codespell ( https://github.com/codespell-project/codespell/ ) run across the repo and fixed the obvious issues there in #7752 .

However it seems to me there are options to possibly improve the user tagging experience (and stop typo tags gaining traction), by effectively spellchecking a subset of the tags, probably simply based upon an approved list of possible typos.

For example take the codespell dictionaries:
https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/

And do a reverse lookup against the presets for any possible typos in keys or values and just use them (which avoids a new tag which can appear as a typo being incorrectly flagged). Obviously don't check Name/Brand/Operator etc (or any of the address bits). Perhaps as a safer option just check tags which exist in the presets or something. So for example because there is a preset of building=pavilion, you'd search codespell for typos for building and find:

buiding->building
buidling->building
bulding->building
buliding->building

And pavilion and find:
pavillion->pavilion

Therefore if I type any of the words on the left, it flags an issue that I probably mean the RHS.

@quincylvania
Copy link
Collaborator

@peternewman This sounds cool in theory. I'm not sure if it'd be useful in practice—it might be! I'm a bit worried about the size of shipping a spelling dictionary in iD or the complexity of calling out to an API.

Generally I imagine that mappers use either iD's presets or TagInfo suggestions for selecting tags, so in either case the spelling is handled for them. Do you have any sense of how widespread spelling problems are? A few hundred misspelled tags per year would be regretable, but probably not a major issue.

#4579 is about flagging tags that don't appear in the OSM database yet… perhaps that would be sufficient?

@quincylvania quincylvania added the considering Not Actionable - still considering if this is something we want label Oct 26, 2020
@peternewman
Copy link
Contributor Author

@peternewman This sounds cool in theory. I'm not sure if it'd be useful in practice—it might be! I'm a bit worried about the size of shipping a spelling dictionary in iD or the complexity of calling out to an API.

As mentioned, if you only pick "relevant" words and filter the dictionary the resulting file shouldn't end up too large. I'd agree calling out to an API is probably more hassle than it's worth.

Generally I imagine that mappers use either iD's presets or TagInfo suggestions for selecting tags, so in either case the spelling is handled for them.

I'd agree iD presets should be fine (or can be fixed and typos handled at source). You have to use TagInfo for stuff like my ramp=separate as there aren't presets for it, and I don't think presets would really make sense. The problem is that TagInfo is the source of the problem:
https://taginfo.openstreetmap.org/keys/?key=ramp#values

I'm not sure it's possible to pick a level where typos like sepErate are ignored without filtering out some possible but infrequent suggestions too. E.g. the value you've picked in #7203 won't work for this case. There's a chance it might have if it was in from the beginning, but that doesn't cover all the other editors.

Do you have any sense of how widespread spelling problems are? A few hundred misspelled tags per year would be regretable, but probably not a major issue.

I'm not sure off-hand. If there's an easy way to dump the key-value pairs from OSM it would be pretty easy to generate some general stats on it.

#4579 is about flagging tags that don't appear in the OSM database yet… perhaps that would be sufficient?

That looks like that's primarily keys not values?

@1ec5
Copy link
Collaborator

1ec5 commented Oct 29, 2020

#4579 is about flagging tags that don't appear in the OSM database yet… perhaps that would be sufficient?

That looks like that's primarily keys not values?

I proposed starting out with keys, but in principle the same mechanism could be extended to values if we can reliably distinguish enumerated keys from freeform keys: #4579 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
considering Not Actionable - still considering if this is something we want
Projects
None yet
Development

No branches or pull requests

3 participants