Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement robots exclusion protocol (robots.txt) #182

Closed
LVerneyPEReN opened this issue Oct 21, 2020 · 8 comments
Closed

Implement robots exclusion protocol (robots.txt) #182

LVerneyPEReN opened this issue Oct 21, 2020 · 8 comments

Comments

@LVerneyPEReN
Copy link
Contributor

Hi,

A question has just surfaced about the robots.txt files and whether CGUs project should honor them / keep track of the robots.txt status.

So far, the following script does the job of testing whether the URLs can be crawled by a robot (called "CGUs Bot"):

import json
import os
import requests
import urllib.robotparser

from urllib.parse import urljoin

count = 0
for file in sorted(os.listdir('services')):
    if not file.endswith('.json'):
        continue

    with open(os.path.join('services', file), 'r') as fh:
        service = json.load(fh)
        for document in service['documents']:
            count += 1

            document_url = service['documents'][document]['fetch']
            req = requests.head(document_url)
            if req.status_code == 301:
                print('WARNING: Terms URL is a redirection: %s to %s.' % (document_url, req.headers['Location']))
                robots_url = urljoin(req.headers['Location'], '/robots.txt')
            else:
                robots_url = urljoin(document_url, '/robots.txt')

            req = requests.get(robots_url, headers={'User-Agent': 'CGUs Bot'})
            if req.status_code != 200:
                print('WARNING: Status Code %s for %s.' % (req.status_code, robots_url))
                continue
            rp = urllib.robotparser.RobotFileParser()
            rp.parse(req.text.split())
            if not rp.entries:
                print('FAILED TO PARSE robots.txt for %s.' % robots_url)
            if not rp.can_fetch('CGUs Bot', document_url):
                print(document_url)
print('Scanned a total of %d terms URLs.' % count)

No issue detected at this time with the already existing services files.

Best,

@MattiSG
Copy link
Member

MattiSG commented Oct 21, 2020

Good point.

I'd be in favour of including this check in the validation script. And we can decide what to do when there will be a fail 😉

@MattiSG
Copy link
Member

MattiSG commented Oct 21, 2020

including this check in the validation script

Through an NPM module, of course.

@MattiSG
Copy link
Member

MattiSG commented Oct 21, 2020

Relates to #166.

@LVerneyPEReN
Copy link
Contributor Author

Sure, I just wrote it in a quick-and-dirty way in Python, I can definitely rewrite it as a NPM module.

👍 for having some form of validation. We had a discussion here about whether it should be ran in the CI (typically in the validation script and hence ran only when there is a change or explicit validation) or whether the check should be done at each fetch of the CGUs.

We had an alternative idea of doing the check only when a new version is saved, and to register along the version a "litigious" status if it should not have been crawled according to robots.txt.

@MattiSG
Copy link
Member

MattiSG commented Oct 21, 2020

My suggestion at the time is to add it to the validation script, which means it would run whenever validation is run, that is either manually or whenever a change is made to the service.

I see this more as a way to assess how much of a potential problem this could be. Actually honouring robots.txt would indeed mean checking on each fetch.

@LVerneyPEReN
Copy link
Contributor Author

Actually honouring robots.txt would indeed mean checking on each fetch.

Truly honoring it would indeed mean checking on each fetch. A slightly less conservative approach was to check at each recorded version that we indeed can record it. Then, if later in time, accessing it automatically is forbidden, CGUs would indeed not honor this interdiction. The last recorded version would however be safe since it was recorded in the past, at a point in time where it was admissible.

@MattiSG
Copy link
Member

MattiSG commented Oct 21, 2020

If we check for each version, we're still storing the content in snapshots continuously, which would be a breach of the robots policy…

@MattiSG MattiSG changed the title Should CGUs honor robots.txt directives? Should the Open Terms Archive crawler respect the robots exclusion protocol? Apr 25, 2022
@MattiSG MattiSG changed the title Should the Open Terms Archive crawler respect the robots exclusion protocol? Should fetcher respect the robots exclusion protocol? Apr 25, 2022
@MattiSG
Copy link
Member

MattiSG commented Apr 25, 2022

The robots.txt policy is undergoing formalisation at IETF under the name “Robots Exclusion Protocol (REP)”. That protocol clearly says that “crawlers MUST use the parseable rules”.

Considering Open Terms Archive’s mission of rebalancing power and our legal analysis of the validity of storing publicly available contracts on behalf of those who are affected by them, I doubt we should abide by service providers’ exclusion requirements.

On a technical side, abiding by the REP would considerably complicate the fetching process and negatively impact performance, as we would need to fetch and parse one additional page per document (or, after optimisation, per domain) before storing documents.

Considering the result of @LVerneyPEReN’s interesting experiment, which is that no documents from contrib would be excluded from storage if we were to follow the REP, this technical investment seems even less worth it.

Beyond this observation, my current conclusion is that Open Terms Archive should not respect REP, lest it risks failing its mission. There is thus no plan to include this feature. In order to improve backlog tracking, I will close this issue. Further comments, ideas, and observations are welcome in case of diverging opinions or updated data on potential impact of REP compliance, in particular on new instances such as france 🙂

@MattiSG MattiSG closed this as completed Apr 25, 2022
@MattiSG MattiSG changed the title Should fetcher respect the robots exclusion protocol? Implement robots exclusion protocol (robots.txt) Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants