-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599
Conversation
These test failures are actually good. 🔥 |
I have a Windows desktop, so instead of using CI as a debugging mechanism I can just wrap this PR up once I'm home. I'm glad we caught these. 😄 |
Thanks for taking this on @erip! Just a couple of questions from my side.
|
Edit: a cursory glance at dataset_utils suggests they are all utf8. |
7049f1e
to
30c0966
Compare
Tracked down the underlying issue with encoding here on Windows. See pytorch/pytorch#72713 for context. |
Seem like the culprit is not setting |
Indeed, one option is to read the files in binary and decode appropriately. Ideally the |
Agreed. This could (should) be handled at File Opening so that downstream readers do not have to worry about it. Thanks for picking this up @erip. Looking forward to the resolution at torchdata level. Meanwhile (since I am not sure if it will take longer to resolve APIs etc for FileOpener), can we try closing this PR by reading in binary mode and setting |
Yes, that seems reasonable. |
Great! Then let's try to land this before we make the branch-cut. cc: @Nayef211 |
…DO: replace with FileOpener with appropriate encoding when this lands in upstream pytorch.
OK, I think this should be gtg now @parmeet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @erip! This looks good to me. I will merge it once the CI is green for unit-testing.
Oh one more change @erip: I know IWSLT 16/17 test suit update is pending, but can we at-least update the reading/decoding part for actual datasets text/torchtext/datasets/iwslt2016.py Lines 334 to 338 in eb61b3f
|
Good catch, I'll fix that. |
It looks like there's a lingering issue that I'm trying to debug with IMDB. The cache is written as text, but when I try to change it to being written as bytes and taking appropriate encoding compensation before the cache is written... cache_decompressed_dp = (
cache_decompressed_dp.lines_to_paragraphs()
) # group by label in cache file
+ cache_decompressed_dp = cache_decompressed_dp.map(lambda x: (x[0], x[1].encode()))
cache_decompressed_dp = cache_decompressed_dp.end_caching(
- mode="wt",
- filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x),
+ mode="wb",
+ filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x)
) I'm met with the following errors:
|
The test seems OK if I pass |
It seems that |
@erip could you please push this change, so that we can close this PR in prep for branch-cut? Thanks |
Done! |
Partially address lingering TODO in #1493.
TODO: add IWSLT which fails due to XML parsing errors -- need to investigate.