Links to sites that have DDoS protections are marked as failed #109

NicolasMassart · 2020-09-24T07:45:45Z

More and more sites have DDoS or anti spam protections like Cloudflare DDoS Attack Protection or Godaddy that makes link checker fail on these links (server returns 502 code for instance if you are not an actual browser)

Only way for the moment to me is to exclude the sites.

But in the future we will have more and more site using these tools and link check will be only able to check internal links.
We have to figure out if we have any technical solution to this issue and see how to implement it.

All suggestions are welcome in the comments.
Thanks.

NicolasMassart · 2020-09-25T12:33:28Z

I've been suggested to try using a real browser in these specific failure cases and https://github.com/puppeteer/puppeteer may help. To investigate.

NicolasMassart · 2020-11-02T14:27:13Z

Cloudflare for instance displays a captcha (click to say you are not a robot) to solve as a security before the page is reached.

So even with a puppeteer controlled browser it would ne help solving it. I would be too easy ;)

And we can't implement all the security methods of all web providers. Here it's a CloudFlare one but it will evolve and others will create new ways to secure, it's an endless race.

cadeef · 2020-11-20T23:57:34Z

It might be advantageous to extend MLC enable/disable via annotation to include outcomes that mark as success/fail. I could see an annotation along the lines of markdown-link-check-status: ddos as a means by which to describe an expected behavior; it would permit all other statuses to be checked. This extends to abnormal status codes too-- it would be nice to expect a link to return say a 403 since it isn't authenticated.

NicolasMassart · 2020-11-23T08:17:58Z

Is expecting a link to return a 403 a way to make the site stronger, not sure. The goal of MLC is to help having non broken links. If we hide the issue behind checking an error code that could be anything else including a real issue, that may not help. it can be a 403 because of ddos protection at first but then really become an access denied one and we won't notice.
I wonder if MLC outputting a list of links to manually check at the end would be ok. Like an exclude comment, it would be a manual check comment on a link and it would make listing of links to check possible for MLC but also would document that in the MD code. It's pretty similar to what you suggest with your annotation but will not trigger any real link checking and will never make the tests fail. It's exclude but still keep it in the test list... Not sure what I wrote is very clear though 😄

cadeef · 2020-11-23T11:31:15Z

I’d say that providing a means to define a “good” link is in the spirit of MLC. There are occasions where it is necessary to redefine good and extending the exclude framework to permit that makes the tool more flexible. I chose 403 specifically, it is a scenario in which the destination exists but MLC is unauthenticated. I actually have a use case for it personally on my family intranet site. I wish to verify that the remote document would succeed if authenticated but not add credentials to the source. There are potentially additional scenarios where defining an alternate status code might be beneficial to the user— HTTP status is a suggestion not the law.

NicolasMassart · 2020-11-23T18:14:28Z

You totally had my approval at "my family intranet site" 🤯 😉
Extending the exclusion system may be fine but then we have to make sure users understand that this could generate false negatives (instead of currently have false positives). The output in the log have to be clear that the link was not completely verified and that they should check manually to make sure.

cadeef · 2020-12-18T02:36:41Z

It might be worthwhile investigating the headers that MLC sends with each request. I had a link report "bad" (403) that definitely was not, turns out it was denied by my mod_security setup on a generic rule for not sending an Accept header (log entry sanitized):

[Thu Dec 17 19:48:52.040963 2020] [:error] [pid 11065:tid 139646180726528] [client 127.0.0.1:56384] [client 127.0.0.1] ModSecurity: Access denied with code 403 (phase 2). Operator EQ matched 0 at REQUEST_HEADERS. [file "/etc/httpd/modsecurity.d/activated_rules/modsecurity_crs_21_protocol_anomalies.conf"] [line "47"] [id "960015"] [rev "1"] [msg "Request Missing an Accept Header"] [severity "NOTICE"] [ver "OWASP_CRS/2.2.9"] [maturity "9"] [accuracy "9"] [tag "OWASP_CRS/PROTOCOL_VIOLATION/MISSING_HEADER_ACCEPT"] [tag "WASCTC/WASC-21"] [tag "OWASP_TOP_10/A7"] [tag "PCI/6.5.10"] [hostname "example.com"] [uri "/do-not-track.txt"] [unique_id "X9wKhFxlGZj7qGQp7uBz5wAAAJU"]

NicolasMassart · 2020-12-18T10:39:19Z

@cadeef could you post what your modsec would have accepted?

cadeef · 2020-12-18T22:07:38Z

The issue was the Accept header was missing from the request completely. A standard Accept: */* from MLC would likely do the trick. MDN Header Reference

NicolasMassart · 2021-01-04T09:30:34Z

Should we modify link-check to add this as a default or only document it here for users to add the header in options if needed? (I tend to prefer adding the default in link-check)

NicolasMassart · 2021-01-04T12:08:38Z

But anyway it doesn't seem to be enough (would be too easy) for Cloudflare DDoS protection se for instance on this page https://metamask.zendesk.com/hc/en-us/articles/360015488991-Sending-Ether-New-UI-
There's no easy way obviously...

We need to keep improving Web fabric as broken links are bad for the Web, it makes its original purpose fail. Having a free interconnected network where everyone is equally able to access data... it's less and less true every day, but I keep thinking we have to help finding ways to prevent total failure to happen.

Here are some thoughts about things we might do here (or in Link-check):

When the link returns a 403, try to fallback to a centralised big company API https://developers.google.com/webmaster-tools/search-console-api/v1/getting-started#curl-usage
add a warning level to the logs. Will not fail the build but will display a list of links to manually check.

cadeef · 2021-01-07T16:56:00Z

Should have been more clear in my initial comment, I was doubtful the change would make a difference with Cloudflare. Projects of MLC's nature will continually be on the losing end of the request inspection cat and mouse.

Shoring up the request profile for MLC would likely eliminate some existing edge cases (like mine with standard mod_sec rules), but it will always be a chase. WAFs evolve constantly by design.

I'd argue it's worthwhile, at least in the short term, to patch low hanging fruit that maintains the standard user experience without additional annotation.

An exception framework, with toggleable handling, would be welcome as I've eluded before, but it creates additional complexity. Care in implementation would be necessary to ensure that simplicity of the primary cause (dead simple, bolt-on markdown link checking) is maintained.

I like the idea of optionally punting to a headless third-party service in the event of failure, but it opens up a can of worms (availability, privacy, support) that would need to be carefully navigated.

NicolasMassart · 2021-01-07T17:50:57Z

I agree with your comments. DDoS protection on websites is something I have more and more issues with though... Maybe it's specific to the domain I work in (blockchain) but I think it's, like spam, something we will have to live with... and it will break the basic fabric of the Web. Is it the role of MLC to fix this? Absolutely not, of course, but can we provide tools to help users deal with that? I think so.
What tools? I'm not sure.
Third party api use could for instance be optional and disabled by default. It may be an addon or something that require to accept the privacy impact and provide an api key anyway (so accepting the API terms of use).

I think MLC has to evolve anyway. If we want it to remain a small script, we will end with something that will be useless at some point. If we build a proper software, with extension capabilities by design, we will be able to improve and follow users needs.
I created a discussion about roadmap #147
I can't force anyone to participate, but my opinion is that if we don't discuss the goals for MLC, we will only be able to deal with bug issues and some very use case specific improvements.
We need to lead this as a real product and don't be afraid to refactor. We can try to keep backward compatibility, but that's not a requirement for me if we provide at least the same features and even better ones. If the config file format has to change for that, I think it's not a blocker for many people.

sanmai-NL · 2022-02-09T17:06:10Z

Resolving #111 would help.

Kurt-von-Laven · 2022-04-14T19:13:09Z

As an example of what you all long ago predicted, all links to GitHub Docs (https://docs.github.com/*) started failing for me within the past 13 hours with 403s. I am running markdown-link-check 3.10.0, and the issue is not related to a change in the version of markdown-link-check, so I assume GitHub enhanced their DDoS protections. I am a fan of the proposal to support expecting a 403 but an even bigger fan of the idea of escalation to a headless browser (and, if warranted, additionally escalating from a headless browser to a full browser). I don't think anyone would argue that Puppeteer is a panacea to the inherent arms race markdown-link-check finds itself faced with, but it has far more resources to keep up with changing expectations. I may have missed some pertinent issues, but these were the only two hits for DDoS in their issue tracker:

As suggested in the latter, puppeteer-extra-plugin-stealth can be used to dodge most DDoS protections. As an overall strategy, consolidating around a single (or small number of) large project(s) seems a solid one for the open-source community since we are much stronger together than separate.

Kurt-von-Laven · 2022-08-11T23:03:32Z

I was wrong. The issue with links to GitHub Docs had nothing to do with DDoS, but rather GitHub evidently started requiring that clients accept compression. This .markdown-link-check.json resolved the issue for us.

Adding the "https://www.intel.com/content/www/us/en/developer/articles/news/llama2.html" link to the ignore pattern for the link check file as a workaround for this particular link returning a 403 error. Not an elegant solution but was suggested here: tcort/markdown-link-check#109 Will keep investigating to see if there is a better work around

bobrik · 2024-10-01T18:04:55Z

I hit this with an origin backed by Netlify. They return 403 for User-agent: link-check/5.3.0 header:

$ curl -s -H "user-agent: link-check/5.3.0" -I https://ebpf.io/what-is-ebpf/ | head -n1
HTTP/2 403

The following in the config fixed it:

    "httpHeaders": [
        {
            "urls": [
                "https://"
            ],
            "headers": {
                "user-agent": "pls let me in"
            }
        }
    ],

NicolasMassart self-assigned this Sep 24, 2020

NicolasMassart added enhancement New feature request or implementation help needed I need someone to have a look at this, review, provide inputs, help to fix,... question Issue is a user support request labels Sep 24, 2020

This was referenced Sep 26, 2020

link-check triggers Error from cloudfront #25

Closed

link-check triggers GoDaddy DoS protection #23

Closed

NicolasMassart pinned this issue Oct 1, 2020

NicolasMassart removed the question Issue is a user support request label Oct 2, 2020

NicolasMassart added the hacktoberfest label Oct 12, 2020

NicolasMassart removed the hacktoberfest label Nov 2, 2020

NicolasMassart mentioned this issue Nov 2, 2020

URL in front matters yaml/toml/json are wrongly picked as markdown URLs #128

Closed

NicolasMassart mentioned this issue Nov 2, 2020

Remove Metamask links for excluded list hyperledger/besu-docs#537

Closed

nicholaswilde mentioned this issue Apr 16, 2021

[lint] markdown-link-check Amazon Link Fails nicholaswilde/home-cluster#17

Open

anantshri mentioned this issue Oct 31, 2021

Update mlc_config.json OWASP/owasp-mastg#1958

Merged

effigies mentioned this issue Jan 18, 2022

add SAMRI, update BEP001 link bids-standard/bids-website#222

Merged

brian926 mentioned this issue Nov 29, 2022

Ignore Opensoure.org from Link Check vhive-serverless/vSwarm#381

Merged

aofarrel mentioned this issue Feb 23, 2023

Add cromwell version 84 to table for 1.14 dockstore/dockstore-documentation#234

Merged

wermos mentioned this issue Dec 7, 2023

link check returns 403 for Intel websites #285

Open

CheneyYin mentioned this issue Dec 15, 2023

[Bug] [CI] Dead Links checker failure apache/seatunnel#6015

Closed

3 tasks

humblejay mentioned this issue Apr 29, 2024

[Question/Feedback]: Code Review - Linting & Link Checks / Markdown Link Check (pull_request) Failing Azure/azure-monitor-baseline-alerts#204

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Links to sites that have DDoS protections are marked as failed #109

Links to sites that have DDoS protections are marked as failed #109

NicolasMassart commented Sep 24, 2020 •

edited

Loading

NicolasMassart commented Sep 25, 2020

NicolasMassart commented Nov 2, 2020 •

edited

Loading

cadeef commented Nov 20, 2020 •

edited

Loading

NicolasMassart commented Nov 23, 2020

cadeef commented Nov 23, 2020 •

edited

Loading

NicolasMassart commented Nov 23, 2020

cadeef commented Dec 18, 2020

NicolasMassart commented Dec 18, 2020

cadeef commented Dec 18, 2020

NicolasMassart commented Jan 4, 2021

NicolasMassart commented Jan 4, 2021

cadeef commented Jan 7, 2021

NicolasMassart commented Jan 7, 2021 •

edited

Loading

sanmai-NL commented Feb 9, 2022

Kurt-von-Laven commented Apr 14, 2022

Kurt-von-Laven commented Aug 11, 2022

bobrik commented Oct 1, 2024

Links to sites that have DDoS protections are marked as failed #109

Links to sites that have DDoS protections are marked as failed #109

Comments

NicolasMassart commented Sep 24, 2020 • edited Loading

NicolasMassart commented Sep 25, 2020

NicolasMassart commented Nov 2, 2020 • edited Loading

cadeef commented Nov 20, 2020 • edited Loading

NicolasMassart commented Nov 23, 2020

cadeef commented Nov 23, 2020 • edited Loading

NicolasMassart commented Nov 23, 2020

cadeef commented Dec 18, 2020

NicolasMassart commented Dec 18, 2020

cadeef commented Dec 18, 2020

NicolasMassart commented Jan 4, 2021

NicolasMassart commented Jan 4, 2021

cadeef commented Jan 7, 2021

NicolasMassart commented Jan 7, 2021 • edited Loading

sanmai-NL commented Feb 9, 2022

Kurt-von-Laven commented Apr 14, 2022

Kurt-von-Laven commented Aug 11, 2022

bobrik commented Oct 1, 2024

NicolasMassart commented Sep 24, 2020 •

edited

Loading

NicolasMassart commented Nov 2, 2020 •

edited

Loading

cadeef commented Nov 20, 2020 •

edited

Loading

cadeef commented Nov 23, 2020 •

edited

Loading

NicolasMassart commented Jan 7, 2021 •

edited

Loading