Exception handling in online datapipes #968

SvenDS9 · 2023-01-26T15:21:16Z

Fixes #963

Changes

Add option to HTTPReader to skip over urls causing problems
Add test for this behavior

SvenDS9 · 2023-01-26T15:21:36Z

While working on this, I noticed a few things:

data/torchdata/datapipes/iter/load/online.py

Lines 21 to 33 in 2ca1fa6

    
           # TODO(642): Remove this helper function when https://bugs.python.org/issue42627 is resolved 
        
           def _get_proxies() -> Optional[Dict[str, str]]: 
        
               import os 
        
               if os.name == "nt": 
        
                   proxies = urllib.request.getproxies() 
        
                   address = proxies.get("https") 
        
                   # The default proxy type of Windows is HTTP 
        
                   if address and address.startswith("https"): 
        
                       address = "http" + address[5:] 
        
                       proxies["https"] = address 
        
                       return proxies 
        
               return None

As the issue python/cpython#86793 seems to have been resolved, this can probably be removed.

data/torchdata/datapipes/iter/load/online.py

Lines 40 to 53 in 2ca1fa6

    
               with requests.Session() as session: 
        
                   proxies = _get_proxies() 
        
                   if timeout is None: 
        
                       r = session.get(url, stream=True, proxies=proxies, **query_params)  # type: ignore[arg-type] 
        
                   else: 
        
                       r = session.get(url, timeout=timeout, stream=True, proxies=proxies, **query_params)  # type: ignore[arg-type] 
        
               r.raise_for_status() 
        
               return url, StreamWrapper(r.raw) 
        
           except HTTPError as e: 
        
               raise Exception(f"Could not get the file. [HTTP Error] {e.response}.") 
        
           except RequestException as e: 
        
               raise Exception(f"Could not get the file at {url}. [RequestException] {e.response}.") 
        
           except Exception: 
        
               raise

Here we open a requests-session, which is probably unnecessary as we only do one GET-request and then close the session again.

In addition we convert the exceptions from HTTPError/RequestException to a simple Exception, why?
@NivekT

NivekT · 2023-01-27T22:45:39Z

@SvenDS9 Thanks for spotting these!

Feel free to remove the proxies but please test to make sure it is fine.
Session object is probably not needed unless we want to use it multiple times (e.g. retry). Feel free to change that if you confirm it is not needed.
I don't really remember why they are turned into general Exceptions. I think raising the original ones should be fine. We do want to make sure that whatever error message it produces, it is easy for user to identify that it comes from HttpReader.

Also add some tests Remove unnecessary code

SvenDS9 · 2023-01-30T15:24:24Z

Thanks again for your input!

After carefully reading through the issue, I am not 100% sure that it can be removed safely. The issue only has 3.9 - 3.11 as a label. Not sure if it is a non-issue in python 3.8. I did find it mentioned in the changelogs of 3.9.13 and 3.10.5 but not in 3.8.
If you have additional input how I can make sure that it works for all versions, please let me know.
After looking at the implementation of requests.get() I noticed that internally it also opens a session. Therefore we can leave this as is. (Maybe we can make use of the session at a later point? e.g. for performance improvements, see https://requests.readthedocs.io/en/latest/user/advanced/#session-objects)
As the HTTPError/RequestsException should contain all necessary information I have decided to remove this conversion.

I have also added a few tests and added query_parameters to OnlineReader/GDriveReader for consistency.

NivekT

Overall looks good to me. One question on what type of exception we should skip over.

I think retry will be very helpful if we can have that.

cc: @ejguan to have a look at the API

torchdata/datapipes/iter/load/online.py

ejguan

Overall LGTM with one comment

ejguan · 2023-01-31T14:18:30Z

torchdata/datapipes/iter/load/online.py

+            try:
+                parts = urllib.parse.urlparse(url)
+
+                if re.match(r"(drive|docs)[.]google[.]com", parts.netloc):
+                    yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params)
+                else:
+                    yield _get_response_from_http(url, timeout=self.timeout, **self.query_params)


Can you please wrap try-except around _get_response_from_google_drive or _get_response_from_http?
In your current implementation, there is a chance the skipped Error comes from parts = urllib.parse.urlparse(url) or re.match(r"(drive|docs)[.]google[.]com", parts.netloc).

I will import and merge this after this change. Thanks!

I have implemented this change, but I don't really understand why it is necessary. In my opinion the source of the exception doesn't really matter if we want to skip over them anyway. With this change exceptions caused by trying to parse the url will not be caught.

With this change exceptions caused by trying to parse the url will not be caught.

I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.

NivekT · 2023-02-07T16:18:32Z

torchdata/datapipes/iter/load/online.py

+            try:
+                parts = urllib.parse.urlparse(url)
+
+                if re.match(r"(drive|docs)[.]google[.]com", parts.netloc):
+                    yield _get_response_from_google_drive(url, timeout=self.timeout, **self.query_params)
+                else:
+                    yield _get_response_from_http(url, timeout=self.timeout, **self.query_params)


With this change exceptions caused by trying to parse the url will not be caught.

I think that is the point. If the URL cannot be parsed, perhaps users want to know and fix it. If you cannot get a response, then they may want to skip it.

facebook-github-bot · 2023-02-07T16:19:06Z

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-02-07T18:27:08Z

@NivekT merged this pull request in 98222ad.

SvenDS9 and others added 2 commits January 26, 2023 15:51

Add skip_on_error option to Httpreader

9dc1f0b

Merge branch 'pytorch:main' into ExceptionHandlingInOnlineDatapipes

f15af1e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 26, 2023

SvenDS9 and others added 2 commits January 30, 2023 09:23

Merge branch 'pytorch:main' into ExceptionHandlingInOnlineDatapipes

6112fd4

Add paramaters to GDrive/Online Reader for consistency

0c6de2d

Also add some tests Remove unnecessary code

Fix mock_test

d346a18

NivekT reviewed Jan 30, 2023

View reviewed changes

torchdata/datapipes/iter/load/online.py Show resolved Hide resolved

torchdata/datapipes/iter/load/online.py Outdated Show resolved Hide resolved

SvenDS9 and others added 2 commits January 31, 2023 14:16

Merge branch 'pytorch:main' into ExceptionHandlingInOnlineDatapipes

77a9a74

Address PR comments and fix regex in tests

bb0d902

SvenDS9 marked this pull request as ready for review January 31, 2023 13:25

ejguan reviewed Jan 31, 2023

View reviewed changes

SvenDS9 and others added 2 commits February 7, 2023 09:04

Merge branch 'pytorch:main' into ExceptionHandlingInOnlineDatapipes

d88d3d4

Do not catch exceptions caused by parsing/regexing

1f19aad

NivekT approved these changes Feb 7, 2023

View reviewed changes

facebook-github-bot closed this in 98222ad Feb 7, 2023

facebook-github-bot added the Merged label Feb 7, 2023

SvenDS9 deleted the ExceptionHandlingInOnlineDatapipes branch February 15, 2023 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception handling in online datapipes #968

Exception handling in online datapipes #968

SvenDS9 commented Jan 26, 2023

SvenDS9 commented Jan 26, 2023 •

edited

Loading

NivekT commented Jan 27, 2023

SvenDS9 commented Jan 30, 2023 •

edited

Loading

NivekT left a comment

ejguan left a comment

ejguan Jan 31, 2023

NivekT Feb 1, 2023

SvenDS9 Feb 7, 2023

NivekT Feb 7, 2023

NivekT Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

Exception handling in online datapipes #968

Exception handling in online datapipes #968

Conversation

SvenDS9 commented Jan 26, 2023

Changes

SvenDS9 commented Jan 26, 2023 • edited Loading

NivekT commented Jan 27, 2023

SvenDS9 commented Jan 30, 2023 • edited Loading

NivekT left a comment

Choose a reason for hiding this comment

ejguan left a comment

Choose a reason for hiding this comment

ejguan Jan 31, 2023

Choose a reason for hiding this comment

NivekT Feb 1, 2023

Choose a reason for hiding this comment

SvenDS9 Feb 7, 2023

Choose a reason for hiding this comment

NivekT Feb 7, 2023

Choose a reason for hiding this comment

NivekT Feb 7, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Feb 7, 2023

facebook-github-bot commented Feb 7, 2023

SvenDS9 commented Jan 26, 2023 •

edited

Loading

SvenDS9 commented Jan 30, 2023 •

edited

Loading