-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert the BC breaking change in enwik9 data #1559
Conversation
@@ -50,4 +50,4 @@ def EnWik9(root: str, split: Union[Tuple[str], str]): | |||
) | |||
|
|||
data_dp = FileOpener(cache_decompressed_dp, mode="b") | |||
return data_dp.readlines(decode=True, return_path=False) | |||
return data_dp.readlines(decode=True, return_path=False, strip_newline=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it breaks BC, but does it matter as we are revamping the library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?
It's a good point, I am not so sure why it is like that when they were first added. In-fact now when I think about it, it's not even consistent across our datasets.
I know it breaks BC, but does it matter as we are revamping the library?
Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.
Edit: Also would be great not to mix the BC breaking changes with the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.
Edit: Also would be great not to mix the BC breaking changes with the migration.
I also updated the SST2 dataset to return a int
for the labels even though it was returning strings before because that seemed like the expected behavior. Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?
In general, it's a good practice to logically separate PRs that makes it easy for tracking purpose as well as to review :)
Before closing this and #1555, just wanted to check if we are all OK with removal of newlines from datasets? cc: @Nayef211 , @abhinavarora , @erip? Follow-up if yes:
|
I think this makes sense to me. In addition to this, should we identify which datasets should return labels as |
These all seem reasonable to me and easy for users to "revert" independently in userland code if they need to. |
Closing this PR as this is a reasonable BC breaking change. |
In addition to this, should we identify which datasets should return labels as I think this is a reasonable change as well. |
No description provided.