Implement robots exclusion protocol (`robots.txt`) #182

LVerneyPEReN · 2020-10-21T13:36:09Z

Hi,

A question has just surfaced about the robots.txt files and whether CGUs project should honor them / keep track of the robots.txt status.

So far, the following script does the job of testing whether the URLs can be crawled by a robot (called "CGUs Bot"):

import json
import os
import requests
import urllib.robotparser

from urllib.parse import urljoin

count = 0
for file in sorted(os.listdir('services')):
    if not file.endswith('.json'):
        continue

    with open(os.path.join('services', file), 'r') as fh:
        service = json.load(fh)
        for document in service['documents']:
            count += 1

            document_url = service['documents'][document]['fetch']
            req = requests.head(document_url)
            if req.status_code == 301:
                print('WARNING: Terms URL is a redirection: %s to %s.' % (document_url, req.headers['Location']))
                robots_url = urljoin(req.headers['Location'], '/robots.txt')
            else:
                robots_url = urljoin(document_url, '/robots.txt')

            req = requests.get(robots_url, headers={'User-Agent': 'CGUs Bot'})
            if req.status_code != 200:
                print('WARNING: Status Code %s for %s.' % (req.status_code, robots_url))
                continue
            rp = urllib.robotparser.RobotFileParser()
            rp.parse(req.text.split())
            if not rp.entries:
                print('FAILED TO PARSE robots.txt for %s.' % robots_url)
            if not rp.can_fetch('CGUs Bot', document_url):
                print(document_url)
print('Scanned a total of %d terms URLs.' % count)

No issue detected at this time with the already existing services files.

Best,

The text was updated successfully, but these errors were encountered:

MattiSG · 2020-10-21T14:15:10Z

Good point.

I'd be in favour of including this check in the validation script. And we can decide what to do when there will be a fail 😉

MattiSG · 2020-10-21T14:16:00Z

including this check in the validation script

Through an NPM module, of course.

MattiSG · 2020-10-21T14:16:20Z

Relates to #166.

LVerneyPEReN · 2020-10-21T14:33:19Z

Sure, I just wrote it in a quick-and-dirty way in Python, I can definitely rewrite it as a NPM module.

👍 for having some form of validation. We had a discussion here about whether it should be ran in the CI (typically in the validation script and hence ran only when there is a change or explicit validation) or whether the check should be done at each fetch of the CGUs.

We had an alternative idea of doing the check only when a new version is saved, and to register along the version a "litigious" status if it should not have been crawled according to robots.txt.

MattiSG · 2020-10-21T15:22:40Z

My suggestion at the time is to add it to the validation script, which means it would run whenever validation is run, that is either manually or whenever a change is made to the service.

I see this more as a way to assess how much of a potential problem this could be. Actually honouring robots.txt would indeed mean checking on each fetch.

LVerneyPEReN · 2020-10-21T16:26:20Z

Actually honouring robots.txt would indeed mean checking on each fetch.

Truly honoring it would indeed mean checking on each fetch. A slightly less conservative approach was to check at each recorded version that we indeed can record it. Then, if later in time, accessing it automatically is forbidden, CGUs would indeed not honor this interdiction. The last recorded version would however be safe since it was recorded in the past, at a point in time where it was admissible.

MattiSG · 2020-10-21T17:25:00Z

If we check for each version, we're still storing the content in snapshots continuously, which would be a breach of the robots policy…

MattiSG · 2022-04-25T08:06:09Z

The robots.txt policy is undergoing formalisation at IETF under the name “Robots Exclusion Protocol (REP)”. That protocol clearly says that “crawlers MUST use the parseable rules”.

Considering Open Terms Archive’s mission of rebalancing power and our legal analysis of the validity of storing publicly available contracts on behalf of those who are affected by them, I doubt we should abide by service providers’ exclusion requirements.

On a technical side, abiding by the REP would considerably complicate the fetching process and negatively impact performance, as we would need to fetch and parse one additional page per document (or, after optimisation, per domain) before storing documents.

Considering the result of @LVerneyPEReN’s interesting experiment, which is that no documents from contrib would be excluded from storage if we were to follow the REP, this technical investment seems even less worth it.

Beyond this observation, my current conclusion is that Open Terms Archive should not respect REP, lest it risks failing its mission. There is thus no plan to include this feature. In order to improve backlog tracking, I will close this issue. Further comments, ideas, and observations are welcome in case of diverging opinions or updated data on potential impact of REP compliance, in particular on new instances such as france 🙂

MattiSG changed the title ~~Should CGUs honor robots.txt directives?~~ Should the Open Terms Archive crawler respect the robots exclusion protocol? Apr 25, 2022

MattiSG changed the title ~~Should the Open Terms Archive crawler respect the robots exclusion protocol?~~ Should fetcher respect the robots exclusion protocol? Apr 25, 2022

MattiSG closed this as completed Apr 25, 2022

MattiSG changed the title ~~Should fetcher respect the robots exclusion protocol?~~ Implement robots exclusion protocol (robots.txt) Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement robots exclusion protocol (`robots.txt`) #182

Implement robots exclusion protocol (`robots.txt`) #182

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Oct 21, 2020

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Apr 25, 2022

Implement robots exclusion protocol (robots.txt) #182

Implement robots exclusion protocol (robots.txt) #182

Comments

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Oct 21, 2020

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

LVerneyPEReN commented Oct 21, 2020

MattiSG commented Oct 21, 2020

MattiSG commented Apr 25, 2022

Implement robots exclusion protocol (`robots.txt`) #182

Implement robots exclusion protocol (`robots.txt`) #182