Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception handling in online datapipes #968

Closed

Conversation

SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Jan 26, 2023

Fixes #963

Changes

  • Add option to HTTPReader to skip over urls causing problems
  • Add test for this behavior

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 26, 2023
@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Jan 26, 2023

While working on this, I noticed a few things:

# TODO(642): Remove this helper function when https://bugs.python.org/issue42627 is resolved
def _get_proxies() -> Optional[Dict[str, str]]:
import os
if os.name == "nt":
proxies = urllib.request.getproxies()
address = proxies.get("https")
# The default proxy type of Windows is HTTP
if address and address.startswith("https"):
address = "http" + address[5:]
proxies["https"] = address
return proxies
return None

As the issue python/cpython#86793 seems to have been resolved, this can probably be removed.

with requests.Session() as session:
proxies = _get_proxies()
if timeout is None:
r = session.get(url, stream=True, proxies=proxies, **query_params) # type: ignore[arg-type]
else:
r = session.get(url, timeout=timeout, stream=True, proxies=proxies, **query_params) # type: ignore[arg-type]
r.raise_for_status()
return url, StreamWrapper(r.raw)
except HTTPError as e:
raise Exception(f"Could not get the file. [HTTP Error] {e.response}.")
except RequestException as e:
raise Exception(f"Could not get the file at {url}. [RequestException] {e.response}.")
except Exception:
raise

Here we open a requests-session, which is probably unnecessary as we only do one GET-request and then close the session again.

In addition we convert the exceptions from HTTPError/RequestException to a simple Exception, why?
@NivekT

@NivekT
Copy link
Contributor

NivekT commented Jan 27, 2023

@SvenDS9 Thanks for spotting these!

  1. Feel free to remove the proxies but please test to make sure it is fine.
  2. Session object is probably not needed unless we want to use it multiple times (e.g. retry). Feel free to change that if you confirm it is not needed.
  3. I don't really remember why they are turned into general Exceptions. I think raising the original ones should be fine. We do want to make sure that whatever error message it produces, it is easy for user to identify that it comes from HttpReader.

@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Jan 30, 2023

Thanks again for your input!

  1. After carefully reading through the issue, I am not 100% sure that it can be removed safely. The issue only has 3.9 - 3.11 as a label. Not sure if it is a non-issue in python 3.8. I did find it mentioned in the changelogs of 3.9.13 and 3.10.5 but not in 3.8.
    If you have additional input how I can make sure that it works for all versions, please let me know.
  2. After looking at the implementation of requests.get() I noticed that internally it also opens a session. Therefore we can leave this as is. (Maybe we can make use of the session at a later point? e.g. for performance improvements, see https://requests.readthedocs.io/en/latest/user/advanced/#session-objects)
  3. As the HTTPError/RequestsException should contain all necessary information I have decided to remove this conversion.

I have also added a few tests and added query_parameters to OnlineReader/GDriveReader for consistency.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. One question on what type of exception we should skip over.

I think retry will be very helpful if we can have that.

cc: @ejguan to have a look at the API

torchdata/datapipes/iter/load/online.py Show resolved Hide resolved
torchdata/datapipes/iter/load/online.py Outdated Show resolved Hide resolved
@SvenDS9 SvenDS9 marked this pull request as ready for review January 31, 2023 13:25
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM with one comment

Comment on lines 238 to 244
try:
parts = urllib.parse.urlparse(url)

if re.match(r"(drive|docs)[.]google[.]com", parts.netloc):
yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params)
else:
yield _get_response_from_http(url, timeout=self.timeout, **self.query_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please wrap try-except around _get_response_from_google_drive or _get_response_from_http?
In your current implementation, there is a chance the skipped Error comes from parts = urllib.parse.urlparse(url) or re.match(r"(drive|docs)[.]google[.]com", parts.netloc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will import and merge this after this change. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this change, but I don't really understand why it is necessary. In my opinion the source of the exception doesn't really matter if we want to skip over them anyway. With this change exceptions caused by trying to parse the url will not be caught.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change exceptions caused by trying to parse the url will not be caught.

I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.

Comment on lines 238 to 244
try:
parts = urllib.parse.urlparse(url)

if re.match(r"(drive|docs)[.]google[.]com", parts.netloc):
yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params)
else:
yield _get_response_from_http(url, timeout=self.timeout, **self.query_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change exceptions caused by trying to parse the url will not be caught.

I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@NivekT merged this pull request in 98222ad.

@SvenDS9 SvenDS9 deleted the ExceptionHandlingInOnlineDatapipes branch February 15, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Modify exception handling of online Datapipes
4 participants