Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiktoken not published to cargo #24

Open
zurawiki opened this issue Jan 24, 2023 · 11 comments · May be fixed by #167
Open

Tiktoken not published to cargo #24

zurawiki opened this issue Jan 24, 2023 · 11 comments · May be fixed by #167

Comments

@zurawiki
Copy link

It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.

Are there plans to publish the tiktoken crate? Is it published on another registry?

Thanks for your work on this BPE encoder, I've already found it very useful!


Repro:

In a rust project, run

cargo add tiktoken

Expected behavior:

Cargo should find, download and add tiktoken to the available crates

Actual behavior:

$ cargo add tiktoken
    Updating crates.io index
error: the crate `tiktoken` could not be found in registry index.
@spolu
Copy link

spolu commented Feb 2, 2023

@zurawiki
Copy link
Author

zurawiki commented Feb 2, 2023

Very cool @spolu! I'd love to package this code as a separate crate for re-use in different rust projects.

@zurawiki
Copy link
Author

zurawiki commented Feb 2, 2023

For testing this out in other projects, I created and published a rust crate here: https://github.com/zurawiki/tiktoken-rs

Ideally, I hope we can integrate these changes back into the original project, so I'll leave this Issue open until we hear from a maintainer.

@spolu
Copy link

spolu commented Feb 2, 2023

Nice!!

@hauntsaninja
Copy link
Collaborator

Thanks, I'm open to this, I just haven't spent the time to figure out Rust packaging yet :-)

I will get around to this at some point, thanks for the link to your repo!

@Emasoft

This comment was marked as off-topic.

@DhruvDh
Copy link

DhruvDh commented Mar 3, 2023

This is not my area of expertise, but if I have a suggestion -

You can make a cargo workspace, create a tiktoken-lib or a tiktoken-core rust project, and then import it within the current lib.rs. That way it is housed within this repository itself.

https://crates.io/crates/cargo-workspaces is a helper which can allow you to publish individual projects within a workspace. I haven't used it myself though.

@smahm006
Copy link

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I am still learning Rust so I am having a hard time with this...

@jremb
Copy link

jremb commented Mar 24, 2023

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I may be mistaken, but see the batch methods here https://github.com/openai/tiktoken/blob/main/tiktoken/core.py

In which case, you would do something like

pub fn encode_batch(&self, texts: Vec<&str>, allowed_special: HashSet<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_native(t, &allowed_special).0)
            .collect()
}

and

pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_ordinary_native(t))
            .collect()
}

@Miuler
Copy link

Miuler commented Jul 18, 2023

Hi,
A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

@Miuler
Copy link

Miuler commented Jul 18, 2023

Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

Sorry for the question, I am separating the code to have Rust as a crate, but I was looking at a version of the encoder in rust and when translating I had this doubt.

Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 18, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 18, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 18, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 19, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 19, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 20, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Jul 20, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Miuler added a commit to Miuler/tiktoken that referenced this issue Sep 15, 2023
This commit restructures the project from a single-crate workspace into a multi-crate workspace, dividing it into 'rs-tiktoken' and 'py-tiktoken'. This is done to improve the clarity of the organization of the codebase and make the Rust and Python modules separate for easier code maintenance. The setup.py is also updated to reflect these changes in the directory structure.

Refs: openai#24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants