-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't report spelling mistakes in HTTP URLs (e.g. in comments) #676
Comments
Yeah, I see the same with version 1.16.0
produces
which would obviously break the url. |
Similarly this fires for URLs in Markdown sources. For example:
... reports that "ist" is a spelling error. It seems like it should be straightforward to ignore URLs with common protocols (eg, skip from |
Thanks to @jonmeow for his work on this in #1592 . Initially I personally thought this was really neat and a nice way to fix it, but then I pondered a bit more, it only took a few tries with our list of Microsoft typos to find some on grep.app that were genuine typos (as opposed to all the fishing attempts), for example this one: Wouldn't it be better if they were caught in your documentation? I wonder if the better solution is a URI/email specific skip list, so you can tell it to ignore ist or cas in those places, but still find the other places where you meant its or case?
This specific example the typo has been correctly moved to the GB to US dictionary. |
but that's not the fix - "mitre" is the name of a company. it's not an "re" to "er" issue. https://www.mitre.org/about/corporate-overview How about adding an option to check urls with the urllib ? That is the purpose of codespell - checking for correctness - spelling is the main things, but for urls spelling doesn't matter - Few web sites have proper spelling in their urls - it's the response code (do you get a 404 or something else?) Just a thought. It eliminate the possibility of offline checking, but who is offline anymore? |
I know, my point was codespell was over-eager there, it isn't any more. I'd imagine in most cases you're probably referring to your own company or organisation, so it can be allowed throughout documentation.
I'd be amazed if that doesn't exist already, if not, to me at least, it should be another project.
I'd actually disagree there, while I was trying to find some examples, I kept stumbling across other examples of typos: Looking at them, I suspect a good chunk of them are for nefarious purposes on a few of those repos, so I'm sure they go to a lot of effort to ensure you get a 200 regardless of whatever URL you try to browse to at their website and then serve up something nasty, urllib won't catch that. Also on the practical front, I'm involved in two projects which check URLs as part of their validation, you'll get false positives half the time:
https://www.mitre.org/about/corporate-overview Your example, and the URL of this page, beg to differ!
It could always be an option and fail back gracefully. |
This is not a correct assumption in general. My example was a link that has nothing to do with my company or organization, and that involves a "word" that is unlikely to appear anywhere else in the rest of our repository, and is a plausible misspelling outside of URLs.
In the URL of this page, I see |
Okay, but an ignore list specific to URLs would still cover this use case wouldn't it? With the benefit it would actually trap genuine typos within URLs, or worse mistakes which mean links point to dubious sites which are typo-squatting.
Although they aren't words, they also aren't typos though (remembering #1535). |
Outside of URLs, "com" is probably a typo for "come". |
I had a look at quite a few pages of grep.app and didn't find that typo, most usages were COM in terms of communications port. We don't have any typos in the dictionary for it either. |
Would it make sense if I switched my PR around to have a dedicated regexp for emails/urls, and a separate flag like -L to only ignore certain misspellings in the email/uri? Maybe with support like "--ignore-in-uri=*" to ignore everything if a user wants? |
Yeah that would be my preference @jonmeow ; actually why not pivot, strip the URI stuff out of your existing PR and re-purpose it as a generic ignore regex option, for example I guess someone may want to ignore MACs or UUIDs or something, as they may get misinterpreted as words (e.g.
I suspect it would make sense to just prefix the new options with uri, perhaps just start off with |
The dictionary option is also potentially likely to be up there as being a useful extra. |
Stemming from comments on #1592, I want to double-check my plans:
Intended matches will be, roughly:
If you have a preference for what I do as far as a regex, please let me know -- I may see if something really loose, like |
@LennyPhoenix This is not a default behavior, it is something you must opt into. For example, |
Understood, thanks! |
Running codespell 1.13.0 against a file containing e.g. a comment like:
is showing a spelling mistake for "cas". Codespell could ignore such URLs as (from my experience) most of these are false positives.
The text was updated successfully, but these errors were encountered: