-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception handling in online datapipes #968
Conversation
While working on this, I noticed a few things: data/torchdata/datapipes/iter/load/online.py Lines 21 to 33 in 2ca1fa6
As the issue python/cpython#86793 seems to have been resolved, this can probably be removed. data/torchdata/datapipes/iter/load/online.py Lines 40 to 53 in 2ca1fa6
Here we open a requests-session, which is probably unnecessary as we only do one GET-request and then close the session again. In addition we convert the exceptions from |
@SvenDS9 Thanks for spotting these!
|
Also add some tests Remove unnecessary code
Thanks again for your input!
I have also added a few tests and added query_parameters to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. One question on what type of exception we should skip over.
I think retry will be very helpful if we can have that.
cc: @ejguan to have a look at the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM with one comment
try: | ||
parts = urllib.parse.urlparse(url) | ||
|
||
if re.match(r"(drive|docs)[.]google[.]com", parts.netloc): | ||
yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params) | ||
else: | ||
yield _get_response_from_http(url, timeout=self.timeout, **self.query_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please wrap try-except
around _get_response_from_google_drive
or _get_response_from_http
?
In your current implementation, there is a chance the skipped Error comes from parts = urllib.parse.urlparse(url)
or re.match(r"(drive|docs)[.]google[.]com", parts.netloc)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will import and merge this after this change. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this change, but I don't really understand why it is necessary. In my opinion the source of the exception doesn't really matter if we want to skip over them anyway. With this change exceptions caused by trying to parse the url will not be caught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change exceptions caused by trying to parse the url will not be caught.
I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.
try: | ||
parts = urllib.parse.urlparse(url) | ||
|
||
if re.match(r"(drive|docs)[.]google[.]com", parts.netloc): | ||
yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params) | ||
else: | ||
yield _get_response_from_http(url, timeout=self.timeout, **self.query_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change exceptions caused by trying to parse the url will not be caught.
I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fixes #963
Changes
HTTPReader
to skip over urls causing problems