Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter_fn to all dataset builders #1768

Closed
RdoubleA opened this issue Oct 8, 2024 · 3 comments
Closed

Add filter_fn to all dataset builders #1768

RdoubleA opened this issue Oct 8, 2024 · 3 comments
Labels
better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs community help wanted We would love the community's help completing this issue

Comments

@RdoubleA
Copy link
Contributor

RdoubleA commented Oct 8, 2024

This is a common operation for any HF based dataset. All dataset builders should have this exposed, right now only text_completion_dataset does and maybe a couple others.

@RdoubleA RdoubleA added community help wanted We would love the community's help completing this issue better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs labels Oct 8, 2024
@krammnic
Copy link
Contributor

krammnic commented Oct 9, 2024

From generic dataset classes, I assume that it is not added only to PreferenceDataset(It is not required in ConcatDataset and PackDataset, because they are composite). Also, probably should be added to builders API with default value None. Will open PR soon.

@RdoubleA
Copy link
Contributor Author

RdoubleA commented Oct 9, 2024

@krammnic you're on an incredible streak, PR would be much appreciated. Yes, PreferenceDataset + preference_dataset, instruct_dataset, chat_dataset need it from the generic dataset builders. Default value None makes sense to me.

@krammnic
Copy link
Contributor

krammnic commented Oct 9, 2024

Created a PR to fix this. I also thought that add filter_fn in other builders APIs is good idea either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs community help wanted We would love the community's help completing this issue
Projects
None yet
Development

No branches or pull requests

2 participants