-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiktoken not published to cargo #24
Comments
Very cool @spolu! I'd love to package this code as a separate crate for re-use in different rust projects. |
For testing this out in other projects, I created and published a rust crate here: https://github.com/zurawiki/tiktoken-rs Ideally, I hope we can integrate these changes back into the original project, so I'll leave this Issue open until we hear from a maintainer. |
Nice!! |
Thanks, I'm open to this, I just haven't spent the time to figure out Rust packaging yet :-) I will get around to this at some point, thanks for the link to your repo! |
This comment was marked as off-topic.
This comment was marked as off-topic.
This is not my area of expertise, but if I have a suggestion - You can make a cargo workspace, create a https://crates.io/crates/cargo-workspaces is a helper which can allow you to publish individual projects within a workspace. I haven't used it myself though. |
Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads. I am still learning Rust so I am having a hard time with this... |
I may be mistaken, but see the In which case, you would do something like pub fn encode_batch(&self, texts: Vec<&str>, allowed_special: HashSet<&str>) -> Vec<Vec<usize>> {
texts
.into_par_iter()
.map(|t| self.encode_native(t, &allowed_special).0)
.collect()
} and pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>> {
texts
.into_par_iter()
.map(|t| self.encode_ordinary_native(t))
.collect()
} |
Hi, def gpt2():
mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
)
return {
"name": "gpt2",
"explicit_n_vocab": 50257,
"pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
"mergeable_ranks": mergeable_ranks,
"special_tokens": {"<|endoftext|>": 50256},
} Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated? |
Sorry for the question, I am separating the code to have Rust as a crate, but I was looking at a version of the encoder in rust and when translating I had this doubt. |
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure. Refs: openai#24
It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.
Are there plans to publish the
tiktoken
crate? Is it published on another registry?Thanks for your work on this BPE encoder, I've already found it very useful!
Repro:
In a rust project, run
Expected behavior:
Cargo should find, download and add
tiktoken
to the available cratesActual behavior:
The text was updated successfully, but these errors were encountered: