Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsound call of set_var #1664

Merged
merged 3 commits into from
Oct 25, 2024
Merged

Conversation

sftse
Copy link
Contributor

@sftse sftse commented Oct 22, 2024

There have been multiple attempts to communicate that the pattern &mut *(&T as *const T as *mut T) is not sound, as in #1485 .

This is the most straightforward modification of the code that passes cargo miri. See #1651 for the fix.
To detect the unsoundness, the parallelism must be turned off under miri, otherwise it falls over because of some unrelated issue in crossbeam-epoch:

MIRIFLAGS='-Zmiri-env-set=TOKENIZERS_PARALLELISM=false' cargo +nightly miri test bpe::trainer::tests::test_train

There are two soundness issuesis one soundness issue fix and one code cleanup in this PR.

The first soundness issue is transforming & to &mut, the second is hidden by the fact that std until recently had a safe env::set_var function. This was a breaking change and is as of now an unsafe function to call.
Lastly, some elided lifetimes were making it quite hard to see whether the use of RefMutContainer as mentioned in #1612 was sound. It does not seem to be unsound.

@sftse sftse force-pushed the unsound-ref-to-mut-ref branch 2 times, most recently from af7d167 to c49b987 Compare October 22, 2024 16:30
@sftse sftse mentioned this pull request Oct 23, 2024
Copy link
Contributor

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really convinced this abstraction pulls its weight compared to the approach in #1651 , it doesn't really internalize any safety invariants

tokenizers/src/models/bpe/trainer.rs Outdated Show resolved Hide resolved
@sftse
Copy link
Contributor Author

sftse commented Oct 23, 2024

I didn't know how sensitive miri was to detecting the UB, so an own module and test seemed appropriate. I agree and have removed it, deferring to #1651 to resolve this issue.

@sftse sftse changed the title Unsound ref to mut ref Unsound call of set_var Oct 23, 2024
@ArthurZucker
Copy link
Collaborator

Thanks a lot for opening the PR and contributing either ways! 🤗

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM had no idea set var was unsafe, we might add this mirin check to our CI!

bindings/python/src/utils/pretokenization.rs Outdated Show resolved Hide resolved
…unsafe

It is generally not safe to set env variables. The correct way to set a config
value that needs to be overridden is to hold a copy internal to the library and
only read from the environment.
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🤗

@ArthurZucker ArthurZucker merged commit 6ea7588 into huggingface:main Oct 25, 2024
12 checks passed
@sftse sftse deleted the unsound-ref-to-mut-ref branch October 25, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants