Skip to content

tests : add test-tokenizers-remote #13846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented May 28, 2025

Adds test-tokenizers-remote that downloads vocab files from HF ggml-org/vocabs and runs test-tokenizer-0 on the files.

This incidentally sent me down the rabbit hole trying to find out what what wrong with the RWKV tokenizer, turns out it was the HF tokenizer all along! :P

@github-actions github-actions bot added the testing Everything test related label May 28, 2025
}
}

if (common_download_file_multiple(files, {}, false)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doubt that you can get rate-limited if the list of files is long. Maybe just download one file at a time here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, downloading one at a time will make the log look nicer too... :)

@@ -87,3 +87,10 @@ struct common_remote_params {
};
// get remote file content, returns <http_code, raw_response_body>
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);

// download one single file from remote URL to local path
bool common_download_file_single(const std::string & url, const std::string & path, const std::string & bearer_token, bool offline);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need to expose these functions. instead, use common_remote_get_content, then write the response content to file using fstream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but then I'll lose all the fancy functionality (caching, multi-threaded download, etc).

@CISC
Copy link
Collaborator Author

CISC commented May 28, 2025

Sigh, not sure where Release in the binary path for windows-latest-cmake comes from, nor how to detect that, anyone got any clues?

Test command: D:\a\llama.cpp\llama.cpp\build\bin\Release\test-tokenizers-remote.exe
Working Directory: D:/a/llama.cpp/llama.cpp/build/bin

Comment on lines 101 to 103
if (LLAMA_CURL)
llama_build_and_test(test-tokenizers-remote.cpp WORKING_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})
endif()
Copy link
Member

@ggerganov ggerganov May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run it with an argument with destination directory like the rest of the tests (i.e. ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models)? This should workaround the Release prefix problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the issue, I need it to find the test-tokenizer-0 executable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize this is system calling another executable. Hm.. I think we need to figure out something better. A lot of this code would be trivial for a Python script - maybe we should figure out some way to run Python scripts as part of the CI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have taken me less time to do it in Python for sure, but it has been educational not to so far. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants