Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

Closed
cfi-gb opened this issue Sep 25, 2018 · 17 comments · Fixed by #1656
Closed

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

cfi-gb opened this issue Sep 25, 2018 · 17 comments · Fixed by #1656

Comments

@cfi-gb
Copy link
Contributor

cfi-gb commented Sep 25, 2018

Running codespell 1.13.0 against a file containing e.g. a comment like:

# Please see http://example.com/cas/ for more info

is showing a spelling mistake for "cas". Codespell could ignore such URLs as (from my experience) most of these are false positives.

@rgetz
Copy link

rgetz commented Apr 29, 2020

Yeah, I see the same with version 1.16.0

/* Many of these debug printf include a Flawfinder: ignore, this is because,
 * according to https://cwe.mitre.org/data/definitions/134.html which describes
 * functions that accepts a format string as an argument, but the format
 * ... */

produces

./debug.h:49: mitre ==> miter

which would obviously break the url.

@zygoloid
Copy link

zygoloid commented Jul 7, 2020

Similarly this fires for URLs in Markdown sources. For example:

- [Type inference and type error diagnosis for Hindley/Milner with extensions (Wazny, J. R., 2006)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.5050&rep=rep1&type=pdf)

... reports that "ist" is a spelling error. It seems like it should be straightforward to ignore URLs with common protocols (eg, skip from http[s]:// to the next likely word break); would a patch to do that be accepted?

@peternewman
Copy link
Collaborator

peternewman commented Jul 15, 2020

Thanks to @jonmeow for his work on this in #1592 .

Initially I personally thought this was really neat and a nice way to fix it, but then I pondered a bit more, it only took a few tries with our list of Microsoft typos to find some on grep.app that were genuine typos (as opposed to all the fishing attempts), for example this one:
https://grep.app/search?q=Micrsoft.com

Wouldn't it be better if they were caught in your documentation?

I wonder if the better solution is a URI/email specific skip list, so you can tell it to ignore ist or cas in those places, but still find the other places where you meant its or case?

./debug.h:49: mitre ==> miter

This specific example the typo has been correctly moved to the GB to US dictionary.

@rgetz
Copy link

rgetz commented Jul 15, 2020

./debug.h:49: mitre ==> miter

This specific example the typo has been correctly moved to the GB to US dictionary.

but that's not the fix - "mitre" is the name of a company. it's not an "re" to "er" issue.

https://www.mitre.org/about/corporate-overview

How about adding an option to check urls with the urllib ?

That is the purpose of codespell - checking for correctness - spelling is the main things, but for urls spelling doesn't matter - Few web sites have proper spelling in their urls - it's the response code (do you get a 404 or something else?)

Just a thought.

It eliminate the possibility of offline checking, but who is offline anymore?

@peternewman
Copy link
Collaborator

This specific example the typo has been correctly moved to the GB to US dictionary.

but that's not the fix - "mitre" is the name of a company. it's not an "re" to "er" issue.

I know, my point was codespell was over-eager there, it isn't any more.

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

How about adding an option to check urls with the urllib ?

I'd be amazed if that doesn't exist already, if not, to me at least, it should be another project.

That is the purpose of codespell - checking for correctness - spelling is the main things, but for urls spelling doesn't matter

I'd actually disagree there, while I was trying to find some examples, I kept stumbling across other examples of typos:
https://grep.app/search?q=micosoft.com

Looking at them, I suspect a good chunk of them are for nefarious purposes on a few of those repos, so I'm sure they go to a lot of effort to ensure you get a 200 regardless of whatever URL you try to browse to at their website and then serve up something nasty, urllib won't catch that.

Also on the practical front, I'm involved in two projects which check URLs as part of their validation, you'll get false positives half the time:
OpenLightingProject/open-fixture-library#999 (you can see the intermittentness in the ticks and crosses)
https://github.com/OpenLightingProject/rdm-app/blob/master/data/manufacturer_test.py#L82-L111 (half of these still flake out half the time)

Few web sites have proper spelling in their urls

https://www.mitre.org/about/corporate-overview
#676

Your example, and the URL of this page, beg to differ!

It eliminate the possibility of offline checking, but who is offline anymore?

It could always be an option and fail back gracefully.

@zygoloid
Copy link

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

This is not a correct assumption in general. My example was a link that has nothing to do with my company or organization, and that involves a "word" that is unlikely to appear anywhere else in the rest of our repository, and is a plausible misspelling outside of URLs.

Few web sites have proper spelling in their urls

https://www.mitre.org/about/corporate-overview
#676

Your example, and the URL of this page, beg to differ!

In the URL of this page, I see https, www, github, com, and codespell, all of which are not words.

@peternewman
Copy link
Collaborator

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

This is not a correct assumption in general. My example was a link that has nothing to do with my company or organization, and that involves a "word" that is unlikely to appear anywhere else in the rest of our repository, and is a plausible misspelling outside of URLs.

Okay, but an ignore list specific to URLs would still cover this use case wouldn't it? With the benefit it would actually trap genuine typos within URLs, or worse mistakes which mean links point to dubious sites which are typo-squatting.

In the URL of this page, I see https, www, github, com, and codespell, all of which are not words.

Although they aren't words, they also aren't typos though (remembering #1535).

@zygoloid
Copy link

Outside of URLs, "com" is probably a typo for "come".

@peternewman
Copy link
Collaborator

Outside of URLs, "com" is probably a typo for "come".

I had a look at quite a few pages of grep.app and didn't find that typo, most usages were COM in terms of communications port. We don't have any typos in the dictionary for it either.

@jonmeow
Copy link
Contributor

jonmeow commented Jul 22, 2020

Would it make sense if I switched my PR around to have a dedicated regexp for emails/urls, and a separate flag like -L to only ignore certain misspellings in the email/uri? Maybe with support like "--ignore-in-uri=*" to ignore everything if a user wants?

@peternewman
Copy link
Collaborator

Would it make sense if I switched my PR around to have a dedicated regexp for emails/urls,

Yeah that would be my preference @jonmeow ; actually why not pivot, strip the URI stuff out of your existing PR and re-purpose it as a generic ignore regex option, for example I guess someone may want to ignore MACs or UUIDs or something, as they may get misinterpreted as words (e.g. ba->by, be,), rather than throwing away some perfectly good code.

and a separate flag like -L to only ignore certain misspellings in the email/uri? Maybe with support like "--ignore-in-uri=*" to ignore everything if a user wants?

I suspect it would make sense to just prefix the new options with uri, perhaps just start off with --uri-ignore-words-list, but it gives space for someone to implement --uri-ignore-words if people want it in a file in future, etc as well as any other options in a logical way.

@peternewman
Copy link
Collaborator

I suspect it would make sense to just prefix the new options with uri, perhaps just start off with --uri-ignore-words-list, but it gives space for someone to implement --uri-ignore-words if people want it in a file in future, etc as well as any other options in a logical way.

The dictionary option is also potentially likely to be up there as being a useful extra.

@jonmeow
Copy link
Contributor

jonmeow commented Jul 27, 2020

Stemming from comments on #1592, I want to double-check my plans:

  • Put in a regex for URIs. It'll only change behavior if a relevant flag is set.
  • Add --uri-ignore-words-list with semantics mirroring --ignore-words-list.

Intended matches will be, roughly:

  • regular URIs:
    • http://foo
    • http://foo.com
    • http://foo.com/path/to/resource?args=vals
    • ipv4 and ipv6 domains for http
    • https, https, ftp, tftp, file, git, smb
  • emails:
    • foo@bar.com
    • foo@bar -- I still have some hesitance about this, since I've never seen it and feel I've seen @ used other ways, but will include per request.
    • Ensure it works with an optional mailto

If you have a preference for what I do as far as a regex, please let me know -- I may see if something really loose, like \W((?:https?|t?ftp|file|git|smb)://\S*), gives reasonable results since I think this doesn't need to be a URI validator.

@peternewman
Copy link
Collaborator

Given #1656 is now in thanks to @jonmeow 's efforts, I think this issue can now be closed.

@LennyPhoenix
Copy link

LennyPhoenix commented Apr 23, 2024

Just installed fresh and I'm still getting spelling lints in URLs, even though this is closed?

image

(This is a markdown file)

@jonmeow
Copy link
Contributor

jonmeow commented Apr 23, 2024

@LennyPhoenix This is not a default behavior, it is something you must opt into. For example, --uri-ignore-words-list=* disables all spelling suggestions in URIs. See codespell -h for flag documentation.

@LennyPhoenix
Copy link

Understood, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants