revert the BC breaking change in enwik9 data #1559

parmeet · 2022-01-31T03:14:06Z

No description provided.

abhinavarora · 2022-01-31T04:12:30Z

torchtext/datasets/enwik9.py

@@ -50,4 +50,4 @@ def EnWik9(root: str, split: Union[Tuple[str], str]):
    )

    data_dp = FileOpener(cache_decompressed_dp, mode="b")
-    return data_dp.readlines(decode=True, return_path=False)
+    return data_dp.readlines(decode=True, return_path=False, strip_newline=False)


Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

I know it breaks BC, but does it matter as we are revamping the library?

Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

It's a good point, I am not so sure why it is like that when they were first added. In-fact now when I think about it, it's not even consistent across our datasets.

I know it breaks BC, but does it matter as we are revamping the library?

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

I also updated the SST2 dataset to return a int for the labels even though it was returning strings before because that seemed like the expected behavior. Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

In general, it's a good practice to logically separate PRs that makes it easy for tracking purpose as well as to review :)

parmeet · 2022-02-03T00:54:39Z

Before closing this and #1555, just wanted to check if we are all OK with removal of newlines from datasets? cc: @Nayef211 , @abhinavarora , @erip?

Follow-up if yes:

Fix breaking tests for enwik9 and PennTreeBank by updating the hash code in raw_datasets.jsonl
Look for other datasets that return newlines and remove accordingly

Nayef211 · 2022-02-03T05:32:40Z

Before closing this and #1555, just wanted to check if we are all OK with removal of newlines from datasets? cc: @Nayef211 , @abhinavarora , @erip?

Follow-up if yes:

Fix breaking tests for enwik9 and PennTreeBank by updating the hash code in raw_datasets.jsonl

Look for other datasets that return newlines and remove accordingly

I think this makes sense to me. In addition to this, should we identify which datasets should return labels as int instead of string and update those as well?

erip · 2022-02-03T12:49:15Z

These all seem reasonable to me and easy for users to "revert" independently in userland code if they need to.

parmeet · 2022-02-03T14:58:54Z

Closing this PR as this is a reasonable BC breaking change.

parmeet · 2022-02-03T15:00:26Z

In addition to this, should we identify which datasets should return labels as int instead of string and update those as well?

I think this is a reasonable change as well.

revert the BC breaking change in enwik9 data

229ee9f

pytorch-bot bot added the ciflow/default label Jan 31, 2022

facebook-github-bot added the cla signed label Jan 31, 2022

abhinavarora mentioned this pull request Jan 31, 2022

Revert BC breaking change in PennTreebank dataset #1555

Closed

abhinavarora reviewed Jan 31, 2022

View reviewed changes

parmeet closed this Feb 3, 2022

This was referenced Feb 3, 2022

[Bug-Fix] fix hash for datasets testing #1572

Merged

fixes platform-dependent expectation for Multi30k mocked test. #1593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert the BC breaking change in enwik9 data #1559

revert the BC breaking change in enwik9 data #1559

parmeet commented Jan 31, 2022

abhinavarora Jan 31, 2022

abhinavarora Jan 31, 2022

parmeet Feb 2, 2022 •

edited

Loading

Nayef211 Feb 2, 2022

parmeet Feb 3, 2022

parmeet commented Feb 3, 2022

Nayef211 commented Feb 3, 2022

erip commented Feb 3, 2022

parmeet commented Feb 3, 2022

parmeet commented Feb 3, 2022

revert the BC breaking change in enwik9 data #1559

revert the BC breaking change in enwik9 data #1559

Conversation

parmeet commented Jan 31, 2022

abhinavarora Jan 31, 2022

Choose a reason for hiding this comment

abhinavarora Jan 31, 2022

Choose a reason for hiding this comment

parmeet Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

Nayef211 Feb 2, 2022

Choose a reason for hiding this comment

parmeet Feb 3, 2022

Choose a reason for hiding this comment

parmeet commented Feb 3, 2022

Nayef211 commented Feb 3, 2022

erip commented Feb 3, 2022

parmeet commented Feb 3, 2022

parmeet commented Feb 3, 2022

parmeet Feb 2, 2022 •

edited

Loading