Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disallow bug #83

Open
Sobes76rus opened this issue Feb 11, 2018 · 7 comments
Open

disallow bug #83

Sobes76rus opened this issue Feb 11, 2018 · 7 comments

Comments

@Sobes76rus
Copy link

Hello!

user-agent: *
disallow: */test

http://mysite.com/test is allowed, but must be disallowed. Am i right?

@b4hand
Copy link
Contributor

b4hand commented Feb 12, 2018

I can confirm this fails. However, this is probably an issue in rep-cpp rather than reppy since it handles the actual robots.txt parsing. Also, I'm not really sure how well rep-cpp actually supports wildcards. Technically, the original robots.txt specification did not allow wildcards, but we do support several extensions.

@b4hand
Copy link
Contributor

b4hand commented Feb 12, 2018

I've figured out what the issue is. It appears that internally */test is being normalized to /*/test because normally rules are absolute paths and can't be relatively, but obviously /*/test doesn't match http://mysite.com/test because there's not two slashes. I'm not sure exactly what's the best way to handle this, but I'll file a bug against rep-cpp.

@b4hand
Copy link
Contributor

b4hand commented Feb 12, 2018

See seomoz/rep-cpp#34.

@Sobes76rus
Copy link
Author

thx =)

@jspalink
Copy link
Contributor

jspalink commented Feb 19, 2018

Related to this -- i'm noticing an issue parsing even when the wildcard is not leading in a disallow rule. Here's an example:

https://www.theverge.com/robots.txt

The Disallow: /users/*/replies rule essentially causes all allowed calls to return false.

>>> robots = Robots.fetch('https://www.theverge.com/robots.txt')
>>> robots.allowed('https://www.theverge.com/anything', 'my-user-agent')
False
>>> list(robots.sitemaps)
[]

Coincidentally, it also fails to parse the sitemaps. A bit of exploring reveals this:

>>> robots.__str__()
b'{"*": [Directive(Disallow: /)]}'

Editing their robots.txt to remove the offending line solves this problem.

@dlecocq
Copy link
Contributor

dlecocq commented Feb 20, 2018

That particular site returns a 403 for Robots.fetch with the default user agent provided by requests:

>>> robots = Robots.fetch('https://www.theverge.com/robots.txt')
>>> robots
<reppy.robots.AllowNone object at 0x108c55450>
>>> requests.get('https://www.theverge.com/robots.txt')
<Response [403]>

Providing a different user agent resolves the issue:

>>> headers = {'User-Agent': 'Chrome'}
>>> robots = Robots.fetch('https://www.theverge.com/robots.txt', headers=headers)
>>> robots
<reppy.robots.Robots object at 0x109461250>
>>> robots.allowed('https://www.theverge.com/anything', 'my-user-agent')
True
>>> list(robots.sitemaps)
['https://www.theverge.com/sitemaps', 'https://www.theverge.com/sitemaps/videos', 'https://www.theverge.com/sitemaps/google_news']

@jspalink

@jspalink
Copy link
Contributor

@dlecocq - that works for me. Thank you for the tip!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants