Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport improvements from the Dolma repo #7

Open
dirkgr opened this issue Sep 20, 2023 · 1 comment
Open

Backport improvements from the Dolma repo #7

dirkgr opened this issue Sep 20, 2023 · 1 comment
Assignees

Comments

@dirkgr
Copy link
Member

dirkgr commented Sep 20, 2023

@chris-ha458 has made some great improvements to BFF in the https://github.com/allenai/dolma repo. We should back-port those changes here, especially the ones that have to do with correctness (like the ones involving the choice of hash functions).

Chris' PRs are here:

They won't apply 1:1, because things have changed in the Dolma repo, but at least the important things should carry over.

@chris-ha458
Copy link
Contributor

From what has been communicated to me in discord, the plan is to encapsulate and refactor out the implementation within dolma to use this crate.

If this is correct, I'll try to bring the work here myself as well (a bit busy so cannot commit on any specifics more than that)

In the meantime I'll share how I approached and prepared the original PRs.

I tended to follow the hints from cargo clippy and when everything was clear there, cargo clippy -- -Wclippy::pedantic
pedantic can potentially lead to more false positives, but it is still useful to understand whether they are indeed false positives and why. If it is indeed spurious, we can document and allow the lints individually or crate wide.

I also tried to maximize refactoring and modularization, which for me is helpful to understand and lint against.

Some more semantic-syntactic changes I was looking for was investigating the relevant mathematic formula and Rust code, taking into detail how they match and differ. Making clear what was or could be done for code readability and performance was helpful.

Supporting all of this is of course more tests, and I think adding tests here in an organic manner could be the first step forward, both enhancing test coverage here and allowing for more dynamic and robust refactors and fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants