Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert the BC breaking change in enwik9 data #1559

Closed
wants to merge 1 commit into from

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented Jan 31, 2022

No description provided.

@@ -50,4 +50,4 @@ def EnWik9(root: str, split: Union[Tuple[str], str]):
)

data_dp = FileOpener(cache_decompressed_dp, mode="b")
return data_dp.readlines(decode=True, return_path=False)
return data_dp.readlines(decode=True, return_path=False, strip_newline=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it breaks BC, but does it matter as we are revamping the library?

Copy link
Contributor Author

@parmeet parmeet Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

It's a good point, I am not so sure why it is like that when they were first added. In-fact now when I think about it, it's not even consistent across our datasets.

I know it breaks BC, but does it matter as we are revamping the library?

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

I also updated the SST2 dataset to return a int for the labels even though it was returning strings before because that seemed like the expected behavior. Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

In general, it's a good practice to logically separate PRs that makes it easy for tracking purpose as well as to review :)

@parmeet
Copy link
Contributor Author

parmeet commented Feb 3, 2022

Before closing this and #1555, just wanted to check if we are all OK with removal of newlines from datasets? cc: @Nayef211 , @abhinavarora , @erip?

Follow-up if yes:

  • Fix breaking tests for enwik9 and PennTreeBank by updating the hash code in raw_datasets.jsonl
  • Look for other datasets that return newlines and remove accordingly

@Nayef211
Copy link
Contributor

Nayef211 commented Feb 3, 2022

Before closing this and #1555, just wanted to check if we are all OK with removal of newlines from datasets? cc: @Nayef211 , @abhinavarora , @erip?

Follow-up if yes:

  • Fix breaking tests for enwik9 and PennTreeBank by updating the hash code in raw_datasets.jsonl
  • Look for other datasets that return newlines and remove accordingly

I think this makes sense to me. In addition to this, should we identify which datasets should return labels as int instead of string and update those as well?

@erip
Copy link
Contributor

erip commented Feb 3, 2022

These all seem reasonable to me and easy for users to "revert" independently in userland code if they need to.

@parmeet
Copy link
Contributor Author

parmeet commented Feb 3, 2022

Closing this PR as this is a reasonable BC breaking change.

@parmeet parmeet closed this Feb 3, 2022
@parmeet
Copy link
Contributor Author

parmeet commented Feb 3, 2022

In addition to this, should we identify which datasets should return labels as int instead of string and update those as well?

I think this is a reasonable change as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants