-
Notifications
You must be signed in to change notification settings - Fork 12k
tests : add test-tokenizers-remote #13846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
} | ||
} | ||
|
||
if (common_download_file_multiple(files, {}, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doubt that you can get rate-limited if the list of files is long. Maybe just download one file at a time here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, downloading one at a time will make the log look nicer too... :)
@@ -87,3 +87,10 @@ struct common_remote_params { | |||
}; | |||
// get remote file content, returns <http_code, raw_response_body> | |||
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params); | |||
|
|||
// download one single file from remote URL to local path | |||
bool common_download_file_single(const std::string & url, const std::string & path, const std::string & bearer_token, bool offline); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need to expose these functions. instead, use common_remote_get_content
, then write the response content to file using fstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but then I'll lose all the fancy functionality (caching, multi-threaded download, etc).
..and change cache file name as per suggestion.
Sigh, not sure where
|
tests/CMakeLists.txt
Outdated
if (LLAMA_CURL) | ||
llama_build_and_test(test-tokenizers-remote.cpp WORKING_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run it with an argument with destination directory like the rest of the tests (i.e. ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models
)? This should workaround the Release
prefix problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the issue, I need it to find the test-tokenizer-0
executable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't realize this is system calling another executable. Hm.. I think we need to figure out something better. A lot of this code would be trivial for a Python script - maybe we should figure out some way to run Python scripts as part of the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have taken me less time to do it in Python for sure, but it has been educational not to so far. :)
Adds test-tokenizers-remote that downloads vocab files from HF
ggml-org/vocabs
and runstest-tokenizer-0
on the files.This incidentally sent me down the rabbit hole trying to find out what what wrong with the RWKV tokenizer, turns out it was the HF tokenizer all along! :P