Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RobotsCache: Exceptions cannot be handled (not being re-thrown) #116

Closed
dosisod opened this issue Sep 11, 2019 · 2 comments
Closed

RobotsCache: Exceptions cannot be handled (not being re-thrown) #116

dosisod opened this issue Sep 11, 2019 · 2 comments

Comments

@dosisod
Copy link

dosisod commented Sep 11, 2019

When checking if a page is allowed and an exception is thrown, a traceback is printed to the screen but not rethrown. The allowed() function simply returns False, making it hard to differentiate between disallowed sites and offline sites.

from reppy.cache import RobotsCache

rob=RobotsCache(capacity=100)

# localhost:1234 is not bound, its mean to error
# ConnectionException is thrown. it is printed but not rethrown
out=rob.allowed("http://localhost:1234", "bot")

print(out) # returns False

This is similar to #110 as there is only a True or False output, but here an exception is not being thrown.

I would put in a pull request, but I am having issues understanding where/what to do to test the code I need. My suggestion is to have an optional rethrow=False param here:

def allowed(self, url, agent):

This allows for backwards compatibility with the ability to rethrow instead of returning (if desired)

@dlecocq
Copy link
Contributor

dlecocq commented Sep 11, 2019

Great question! And it's one I had to think back a bit to remember how we manage this. I'll take you on a quick tour, but there's an easy change to accomplish what it sounds like you want.

So, the the Robots class which can do all the fetching, parsing, etc., allows for users to pass through any arguments to fetch that they'd pass to requests, so a lot of behavior can be controlled there. The Cache classes have even more bells and whistles, primarily driven by policies in reppy.cache.policy. At the moment, there are two implemented - 1) one that returns a default object (and for RobotsCache the default cache policy returns a Robots object that returns False for every URL if an exception is raised: https://github.com/seomoz/reppy/blob/master/reppy/cache/__init__.py#L79), and 2) one that reraises the exception. The first might be useful in situations where we just want to answer the question: "are we for sure allowed to fetch this URL", and anything other than successfully sussing that out would mean 'no.'

But if you have code that would like to deal with the nuance of different types of errors, you absolutely can! And you could extend the CachePolicyBase to implement whatever behavior you'd like if the built-in ones don't suit your needs. But if you are just looking to have exceptions raised:

from reppy.cache import RobotsCache
from reppy.cache.policy import ReraiseExceptionPolicy

rob=RobotsCache(capacity=100, cache_policy=ReraiseExceptionPolicy(60))

# localhost:1234 is not bound, its mean to error
# ConnectionException is actually thrown now
out=rob.allowed("http://localhost:1234", "bot")

print(out) # name is not defined

@dosisod
Copy link
Author

dosisod commented Sep 12, 2019

Thank you, this solved my problem. The solution was a lot easier then I anticipated. Thanks for the quick response!

@dosisod dosisod closed this as completed Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants