generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

erip · 2022-02-10T20:35:09Z

Partially address lingering TODO in #1493.

TODO: add IWSLT which fails due to XML parsing errors -- need to investigate.

erip · 2022-02-10T21:00:21Z

These test failures are actually good. 🔥

erip · 2022-02-10T21:30:47Z

I have a Windows desktop, so instead of using CI as a debugging mechanism I can just wrap this PR up once I'm home. I'm glad we caught these. 😄

Nayef211 · 2022-02-10T22:26:45Z

Thanks for taking this on @erip! Just a couple of questions from my side.

Do we know for certain that all of our datasets are UTF8 encoded?
Do you have any initial guesses as to why these test failures are occurring? From a first glance it seems like it might be caused by the LineReaderIterDataPipe and the CSVParserIterDataPipe not handling UTF8 encoded strings correctly.
Does this PR update all our dataset tests to use the UTF-8 encoded strings? If so it might be worthwhile mentioning that in the PR description as well.

erip · 2022-02-10T23:06:54Z

I don't know for certain, but I strongly suspect it. I can confirm later by checking the legacy code to load these datasets -- if the encoding is set, that's evidence.
I suspect that the text readers either aren't passing the appropriate encoding in 'r' mode or aren't decoding to utf-8 in 'b' mode
All but IWSLTs. Can update for clarity.

Edit: a cursory glance at dataset_utils suggests they are all utf8.

test/datasets/test_amazonreviewfull.py

test/datasets/test_yelpreviewfull.py

erip · 2022-02-11T15:15:06Z

Tracked down the underlying issue with encoding here on Windows. See pytorch/pytorch#72713 for context.

parmeet · 2022-02-11T15:18:04Z

I suspect that the text readers either aren't passing the appropriate encoding in 'r' mode or aren't decoding to utf-8 in 'b' mode

Seem like the culprit is not setting decode as True in our readers which is False by default https://github.com/pytorch/data/blob/9c6e5ddfcdf1061e3968ed5cd9d55754cc713965/torchdata/datapipes/iter/util/plain_text_reader.py#L90

erip · 2022-02-11T15:23:42Z

Seem like the culprit is not setting decode as True in our readers

Indeed, one option is to read the files in binary and decode appropriately. Ideally the FileOpener could handle opening the file with appropriate encoding/mode for us since extra downstream decoding is a bit of a pain. See the issue in upstream pytorch for the "better" option and alternatives.

parmeet · 2022-02-11T15:38:40Z

Seem like the culprit is not setting decode as True in our readers

Indeed, one option is to read the files in binary and decode appropriately. Ideally the FileOpener could handle opening the file with appropriate encoding/mode for us since extra downstream decoding is a bit of a pain. See the issue in upstream pytorch for the "better" option and alternatives.

Agreed. This could (should) be handled at File Opening so that downstream readers do not have to worry about it. Thanks for picking this up @erip. Looking forward to the resolution at torchdata level. Meanwhile (since I am not sure if it will take longer to resolve APIs etc for FileOpener), can we try closing this PR by reading in binary mode and setting decode to True for downstream readers? We can then cherry-pick the changes once the FileOpener handles decoding scheme.

erip · 2022-02-11T15:41:19Z

Yes, that seems reasonable.

parmeet · 2022-02-11T15:44:19Z

Yes, that seems reasonable.

Great! Then let's try to land this before we make the branch-cut. cc: @Nayef211

…lution.

…DO: replace with FileOpener with appropriate encoding when this lands in upstream pytorch.

erip · 2022-02-11T15:58:53Z

OK, I think this should be gtg now @parmeet

parmeet

Great work @erip! This looks good to me. I will merge it once the CI is green for unit-testing.

parmeet · 2022-02-11T16:24:23Z

Oh one more change @erip: I know IWSLT 16/17 test suit update is pending, but can we at-least update the reading/decoding part for actual datasets

text/torchtext/datasets/iwslt2016.py

Lines 334 to 338 in eb61b3f

    
           tgt_data_dp = FileOpener(cache_inner_tgt_decompressed_dp, mode="r") 
        
           src_data_dp = FileOpener(cache_inner_src_decompressed_dp, mode="r") 
        
           src_lines = src_data_dp.readlines(return_path=False, strip_newline=False) 
        
           tgt_lines = tgt_data_dp.readlines(return_path=False, strip_newline=False)

erip · 2022-02-11T16:25:11Z

Good catch, I'll fix that.

erip · 2022-02-11T16:57:18Z

It looks like there's a lingering issue that I'm trying to debug with IMDB. The cache is written as text, but when I try to change it to being written as bytes and taking appropriate encoding compensation before the cache is written...

     cache_decompressed_dp = (
         cache_decompressed_dp.lines_to_paragraphs()
     )  # group by label in cache file
+    cache_decompressed_dp = cache_decompressed_dp.map(lambda x: (x[0], x[1].encode()))
     cache_decompressed_dp = cache_decompressed_dp.end_caching(
-        mode="wt",
-        filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x),
+        mode="wb",
+        filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x)
     )

I'm met with the following errors:

test/datasets/test_imdb.py:84: in test_imdb_split_argument
    for d1, d2 in zip_equal(dataset1, dataset2):
test/common/case_utils.py:53: in zip_equal
    for combo in zip_longest(*iterables, fillvalue=sentinel):
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:95: in __iter__
    for data in self.datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../data/torchdata/datapipes/iter/util/plain_text_reader.py:106: in __iter__
    for path, file in self.source_datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/fileopener.py:45: in __iter__
    yield from get_file_binaries_from_pathnames(self.datapipe, self.mode)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/utils/common.py:85: in get_file_binaries_from_pathnames
    for pathname in pathnames:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/combining.py:38: in __iter__
    for data in dp:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../data/torchdata/datapipes/iter/util/saver.py:36: in __iter__
    for filepath, data in self.source_datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:95: in __iter__
    for data in self.datapipe:
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/_typing.py:366: in wrap_generator
    response = gen.send(None)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:96: in __iter__
    yield self._apply_fn(data)
../../opt/miniconda3/envs/torchtext-dev/lib/python3.8/site-packages/torch/utils/data/datapipes/iter/callable.py:69: in _apply_fn
    res = self.fn(data[self.input_col])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

fd = b'\xc3\xafp\xc4\xa9\xc8\xba\xc2\xbd\xc5\xb7\n\xc5\x9b\xe1\x9b\xac\xe2\xb1\xb4\xc7\xbf\xe2\xb1\xa3\xe2\xb1\xa4\xc2\xb6'

    def _read_bytes(fd):
>       return b"".join(fd)
E       TypeError: sequence item 0: expected a bytes-like object, int found

../data/torchdata/datapipes/iter/util/cacheholder.py:215: TypeError

erip · 2022-02-11T17:20:25Z

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

parmeet · 2022-02-11T22:38:03Z

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

It seems that _read_bytes is expecting a file-like object. But here since we are already mapping string to bytes (by doing encoding), this function doesn't work with bytes (although interesting _read_str works both with file-like object and str in which case it simply return the same string). I think it is save to use skip_read=True since we don't really need to read from stream. cc: @ejguan to make sure my understanding is right?

parmeet · 2022-02-11T22:45:41Z

The test seems OK if I pass skip_read=True to end_caching, but I'm not super comfortable with this decision because I don't quite understand what I'm losing from this...

It seems that _read_bytes is expecting a file-like object. But here since we are already mapping string to bytes (by doing encoding), this function doesn't work with bytes (although interesting _read_str works both with file-like object and str in which case it simply return the same string). I think it is save to use skip_read=True since we don't really need to read from stream. cc: @ejguan to make sure my understanding is right?

@erip could you please push this change, so that we can close this PR in prep for branch-cut? Thanks

erip · 2022-02-11T23:23:40Z

Done!

erip · 2022-02-12T00:22:07Z

I think this is good to merge @Nayef211 @parmeet

pytorch-bot bot added the ciflow/default label Feb 10, 2022

facebook-github-bot added the cla signed label Feb 10, 2022

Nayef211 mentioned this pull request Feb 10, 2022

Revamp TorchText Dataset Testing Strategy #1493

Closed

27 tasks

erip changed the title ~~generate unicode strings to test utf-8 handling.~~ generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. Feb 11, 2022

generate unicode strings to test utf-8 handling. TODO: add to IWSLT.

30c0966

erip force-pushed the feature/unicode-tests branch from 7049f1e to 30c0966 Compare February 11, 2022 14:38

fix flake.

0e9d890

parmeet reviewed Feb 11, 2022

View reviewed changes

test/datasets/test_amazonreviewfull.py Outdated Show resolved Hide resolved

test/datasets/test_yelpreviewfull.py Outdated Show resolved Hide resolved

erip added 2 commits February 11, 2022 10:54

remove tests which were accidentally added by bad merge conflict reso…

f081d29

…lution.

fix encoding issue by reading bytes and decoding utf8 as expected. TO…

6a14b1f

…DO: replace with FileOpener with appropriate encoding when this lands in upstream pytorch.

fix flake.

0f7b540

parmeet approved these changes Feb 11, 2022

View reviewed changes

fix IWSLT encoding issue, as well.

5d3782e

fix issue with reading utf-8 encoding files.

2cb7f31

parmeet merged commit 2e93d94 into pytorch:main Feb 12, 2022

erip deleted the feature/unicode-tests branch February 12, 2022 02:26

erip mentioned this pull request Feb 12, 2022

[META] Add missing unicode generation for IWSLTs #1607

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

erip commented Feb 10, 2022

erip commented Feb 10, 2022

erip commented Feb 10, 2022 •

edited

Loading

Nayef211 commented Feb 10, 2022

erip commented Feb 10, 2022 •

edited

Loading

erip commented Feb 11, 2022 •

edited

Loading

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet left a comment

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 11, 2022 •

edited

Loading

parmeet commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 12, 2022

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

Conversation

erip commented Feb 10, 2022

erip commented Feb 10, 2022

erip commented Feb 10, 2022 • edited Loading

Nayef211 commented Feb 10, 2022

erip commented Feb 10, 2022 • edited Loading

erip commented Feb 11, 2022 • edited Loading

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

parmeet left a comment

Choose a reason for hiding this comment

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 11, 2022 • edited Loading

parmeet commented Feb 11, 2022

parmeet commented Feb 11, 2022

erip commented Feb 11, 2022

erip commented Feb 12, 2022

erip commented Feb 10, 2022 •

edited

Loading

erip commented Feb 10, 2022 •

edited

Loading

erip commented Feb 11, 2022 •

edited

Loading

erip commented Feb 11, 2022 •

edited

Loading