-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement robots exclusion protocol (robots.txt
)
#182
Comments
Good point. I'd be in favour of including this check in the validation script. And we can decide what to do when there will be a fail 😉 |
Through an NPM module, of course. |
Relates to #166. |
Sure, I just wrote it in a quick-and-dirty way in Python, I can definitely rewrite it as a NPM module. 👍 for having some form of validation. We had a discussion here about whether it should be ran in the CI (typically in the validation script and hence ran only when there is a change or explicit validation) or whether the check should be done at each fetch of the CGUs. We had an alternative idea of doing the check only when a new version is saved, and to register along the version a "litigious" status if it should not have been crawled according to robots.txt. |
My suggestion at the time is to add it to the validation script, which means it would run whenever validation is run, that is either manually or whenever a change is made to the service. I see this more as a way to assess how much of a potential problem this could be. Actually honouring |
Truly honoring it would indeed mean checking on each fetch. A slightly less conservative approach was to check at each recorded version that we indeed can record it. Then, if later in time, accessing it automatically is forbidden, CGUs would indeed not honor this interdiction. The last recorded version would however be safe since it was recorded in the past, at a point in time where it was admissible. |
If we check for each version, we're still storing the content in snapshots continuously, which would be a breach of the |
The Considering Open Terms Archive’s mission of rebalancing power and our legal analysis of the validity of storing publicly available contracts on behalf of those who are affected by them, I doubt we should abide by service providers’ exclusion requirements. On a technical side, abiding by the REP would considerably complicate the fetching process and negatively impact performance, as we would need to fetch and parse one additional page per document (or, after optimisation, per domain) before storing documents. Considering the result of @LVerneyPEReN’s interesting experiment, which is that no documents from Beyond this observation, my current conclusion is that Open Terms Archive should not respect REP, lest it risks failing its mission. There is thus no plan to include this feature. In order to improve backlog tracking, I will close this issue. Further comments, ideas, and observations are welcome in case of diverging opinions or updated data on potential impact of REP compliance, in particular on new instances such as |
robots.txt
)
Hi,
A question has just surfaced about the
robots.txt
files and whether CGUs project should honor them / keep track of therobots.txt
status.So far, the following script does the job of testing whether the URLs can be crawled by a robot (called "CGUs Bot"):
No issue detected at this time with the already existing services files.
Best,
The text was updated successfully, but these errors were encountered: