Skip to content
This repository has been archived by the owner on Jul 5, 2019. It is now read-only.

Follow robots.txt yes/no #9

Open
eklem opened this issue Nov 4, 2014 · 2 comments
Open

Follow robots.txt yes/no #9

eklem opened this issue Nov 4, 2014 · 2 comments

Comments

@eklem
Copy link
Collaborator

eklem commented Nov 4, 2014

-f --followrobotstxt <yes/no> if you want your fetcher to play nice or not

@eklem
Copy link
Collaborator Author

eklem commented Nov 4, 2014

I guess there are two things to check for.
1: User agent and if it matches specific or * is used.
2: Make an array of parts of site to not follow and check each link that the crawler wants to follow against this array

@eklem
Copy link
Collaborator Author

eklem commented Nov 4, 2014

And default to "yes". The user-agent string connects to this, but it's not necessary to develope this one.
#10

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant