Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

cfi-gb · 2018-09-25T06:48:39Z

Running codespell 1.13.0 against a file containing e.g. a comment like:

# Please see http://example.com/cas/ for more info

is showing a spelling mistake for "cas". Codespell could ignore such URLs as (from my experience) most of these are false positives.

The text was updated successfully, but these errors were encountered:

rgetz · 2020-04-29T18:21:37Z

Yeah, I see the same with version 1.16.0

/* Many of these debug printf include a Flawfinder: ignore, this is because,
 * according to https://cwe.mitre.org/data/definitions/134.html which describes
 * functions that accepts a format string as an argument, but the format
 * ... */

produces

./debug.h:49: mitre ==> miter

which would obviously break the url.

zygoloid · 2020-07-07T00:39:45Z

Similarly this fires for URLs in Markdown sources. For example:

- [Type inference and type error diagnosis for Hindley/Milner with extensions (Wazny, J. R., 2006)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.5050&rep=rep1&type=pdf)

... reports that "ist" is a spelling error. It seems like it should be straightforward to ignore URLs with common protocols (eg, skip from http[s]:// to the next likely word break); would a patch to do that be accepted?

peternewman · 2020-07-15T15:46:15Z

Thanks to @jonmeow for his work on this in #1592 .

Initially I personally thought this was really neat and a nice way to fix it, but then I pondered a bit more, it only took a few tries with our list of Microsoft typos to find some on grep.app that were genuine typos (as opposed to all the fishing attempts), for example this one:
https://grep.app/search?q=Micrsoft.com

Wouldn't it be better if they were caught in your documentation?

I wonder if the better solution is a URI/email specific skip list, so you can tell it to ignore ist or cas in those places, but still find the other places where you meant its or case?

according to https://cwe.mitre.org/data/definitions/134.html which describes

./debug.h:49: mitre ==> miter

This specific example the typo has been correctly moved to the GB to US dictionary.

rgetz · 2020-07-15T18:01:24Z

according to https://cwe.mitre.org/data/definitions/134.html which describes

./debug.h:49: mitre ==> miter

This specific example the typo has been correctly moved to the GB to US dictionary.

but that's not the fix - "mitre" is the name of a company. it's not an "re" to "er" issue.

https://www.mitre.org/about/corporate-overview

How about adding an option to check urls with the urllib ?

That is the purpose of codespell - checking for correctness - spelling is the main things, but for urls spelling doesn't matter - Few web sites have proper spelling in their urls - it's the response code (do you get a 404 or something else?)

Just a thought.

It eliminate the possibility of offline checking, but who is offline anymore?

peternewman · 2020-07-15T22:27:39Z

This specific example the typo has been correctly moved to the GB to US dictionary.

but that's not the fix - "mitre" is the name of a company. it's not an "re" to "er" issue.

I know, my point was codespell was over-eager there, it isn't any more.

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

How about adding an option to check urls with the urllib ?

I'd be amazed if that doesn't exist already, if not, to me at least, it should be another project.

That is the purpose of codespell - checking for correctness - spelling is the main things, but for urls spelling doesn't matter

I'd actually disagree there, while I was trying to find some examples, I kept stumbling across other examples of typos:
https://grep.app/search?q=micosoft.com

Looking at them, I suspect a good chunk of them are for nefarious purposes on a few of those repos, so I'm sure they go to a lot of effort to ensure you get a 200 regardless of whatever URL you try to browse to at their website and then serve up something nasty, urllib won't catch that.

Also on the practical front, I'm involved in two projects which check URLs as part of their validation, you'll get false positives half the time:
OpenLightingProject/open-fixture-library#999 (you can see the intermittentness in the ticks and crosses)
https://github.com/OpenLightingProject/rdm-app/blob/master/data/manufacturer_test.py#L82-L111 (half of these still flake out half the time)

Few web sites have proper spelling in their urls

https://www.mitre.org/about/corporate-overview
#676

Your example, and the URL of this page, beg to differ!

It eliminate the possibility of offline checking, but who is offline anymore?

It could always be an option and fail back gracefully.

zygoloid · 2020-07-15T22:46:26Z

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

This is not a correct assumption in general. My example was a link that has nothing to do with my company or organization, and that involves a "word" that is unlikely to appear anywhere else in the rest of our repository, and is a plausible misspelling outside of URLs.

Few web sites have proper spelling in their urls

https://www.mitre.org/about/corporate-overview
#676

Your example, and the URL of this page, beg to differ!

In the URL of this page, I see https, www, github, com, and codespell, all of which are not words.

peternewman · 2020-07-16T01:33:18Z

I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.

This is not a correct assumption in general. My example was a link that has nothing to do with my company or organization, and that involves a "word" that is unlikely to appear anywhere else in the rest of our repository, and is a plausible misspelling outside of URLs.

Okay, but an ignore list specific to URLs would still cover this use case wouldn't it? With the benefit it would actually trap genuine typos within URLs, or worse mistakes which mean links point to dubious sites which are typo-squatting.

In the URL of this page, I see https, www, github, com, and codespell, all of which are not words.

Although they aren't words, they also aren't typos though (remembering #1535).

zygoloid · 2020-07-16T16:01:38Z

Outside of URLs, "com" is probably a typo for "come".

peternewman · 2020-07-17T15:47:39Z

Outside of URLs, "com" is probably a typo for "come".

I had a look at quite a few pages of grep.app and didn't find that typo, most usages were COM in terms of communications port. We don't have any typos in the dictionary for it either.

jonmeow · 2020-07-22T16:52:55Z

Would it make sense if I switched my PR around to have a dedicated regexp for emails/urls, and a separate flag like -L to only ignore certain misspellings in the email/uri? Maybe with support like "--ignore-in-uri=*" to ignore everything if a user wants?

peternewman · 2020-07-22T22:06:51Z

Would it make sense if I switched my PR around to have a dedicated regexp for emails/urls,

Yeah that would be my preference @jonmeow ; actually why not pivot, strip the URI stuff out of your existing PR and re-purpose it as a generic ignore regex option, for example I guess someone may want to ignore MACs or UUIDs or something, as they may get misinterpreted as words (e.g. ba->by, be,), rather than throwing away some perfectly good code.

and a separate flag like -L to only ignore certain misspellings in the email/uri? Maybe with support like "--ignore-in-uri=*" to ignore everything if a user wants?

I suspect it would make sense to just prefix the new options with uri, perhaps just start off with --uri-ignore-words-list, but it gives space for someone to implement --uri-ignore-words if people want it in a file in future, etc as well as any other options in a logical way.

peternewman · 2020-07-24T12:27:59Z

I suspect it would make sense to just prefix the new options with uri, perhaps just start off with --uri-ignore-words-list, but it gives space for someone to implement --uri-ignore-words if people want it in a file in future, etc as well as any other options in a logical way.

The dictionary option is also potentially likely to be up there as being a useful extra.

jonmeow · 2020-07-27T18:51:34Z

Stemming from comments on #1592, I want to double-check my plans:

Put in a regex for URIs. It'll only change behavior if a relevant flag is set.
Add --uri-ignore-words-list with semantics mirroring --ignore-words-list.

Intended matches will be, roughly:

regular URIs:
- http://foo
- http://foo.com
- http://foo.com/path/to/resource?args=vals
- ipv4 and ipv6 domains for http
- https, https, ftp, tftp, file, git, smb
emails:
- foo@bar.com
- foo@bar -- I still have some hesitance about this, since I've never seen it and feel I've seen @ used other ways, but will include per request.
- Ensure it works with an optional mailto

If you have a preference for what I do as far as a regex, please let me know -- I may see if something really loose, like \W((?:https?|t?ftp|file|git|smb)://\S*), gives reasonable results since I think this doesn't need to be a URI validator.

peternewman · 2021-06-01T23:23:07Z

Given #1656 is now in thanks to @jonmeow 's efforts, I think this issue can now be closed.

LennyPhoenix · 2024-04-23T16:57:20Z

Just installed fresh and I'm still getting spelling lints in URLs, even though this is closed?

(This is a markdown file)

jonmeow · 2024-04-23T17:07:55Z

@LennyPhoenix This is not a default behavior, it is something you must opt into. For example, --uri-ignore-words-list=* disables all spelling suggestions in URIs. See codespell -h for flag documentation.

LennyPhoenix · 2024-04-23T17:08:55Z

Understood, thanks!

geor-g mentioned this issue Oct 15, 2018

False positive on email addresses #708

Open

jonmeow mentioned this issue Jul 8, 2020

Add --ignore-regex #1592

Merged

peternewman mentioned this issue Jul 15, 2020

Typo in examples Azure-Samples/AzureStack-AdminPowerShell-OMSIntegration#19

Closed

jonmeow mentioned this issue Aug 28, 2020

Add support for ignoring spelling mistakes in URIs specifically. #1656

Merged

peternewman linked a pull request Jun 1, 2021 that will close this issue

Add support for ignoring spelling mistakes in URIs specifically. #1656

Merged

peternewman closed this as completed Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

cfi-gb commented Sep 25, 2018 •

edited

Loading

rgetz commented Apr 29, 2020

zygoloid commented Jul 7, 2020

peternewman commented Jul 15, 2020 •

edited

Loading

rgetz commented Jul 15, 2020

peternewman commented Jul 15, 2020

zygoloid commented Jul 15, 2020

peternewman commented Jul 16, 2020

zygoloid commented Jul 16, 2020

peternewman commented Jul 17, 2020

jonmeow commented Jul 22, 2020

peternewman commented Jul 22, 2020

peternewman commented Jul 24, 2020

jonmeow commented Jul 27, 2020 •

edited

Loading

peternewman commented Jun 1, 2021

LennyPhoenix commented Apr 23, 2024 •

edited

Loading

jonmeow commented Apr 23, 2024

LennyPhoenix commented Apr 23, 2024

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676

Comments

cfi-gb commented Sep 25, 2018 • edited Loading

rgetz commented Apr 29, 2020

zygoloid commented Jul 7, 2020

peternewman commented Jul 15, 2020 • edited Loading

rgetz commented Jul 15, 2020

peternewman commented Jul 15, 2020

zygoloid commented Jul 15, 2020

peternewman commented Jul 16, 2020

zygoloid commented Jul 16, 2020

peternewman commented Jul 17, 2020

jonmeow commented Jul 22, 2020

peternewman commented Jul 22, 2020

peternewman commented Jul 24, 2020

jonmeow commented Jul 27, 2020 • edited Loading

peternewman commented Jun 1, 2021

LennyPhoenix commented Apr 23, 2024 • edited Loading

jonmeow commented Apr 23, 2024

LennyPhoenix commented Apr 23, 2024

cfi-gb commented Sep 25, 2018 •

edited

Loading

peternewman commented Jul 15, 2020 •

edited

Loading

jonmeow commented Jul 27, 2020 •

edited

Loading

LennyPhoenix commented Apr 23, 2024 •

edited

Loading