-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vertexai(test): add corpora_test in tokenizer module #10784
Conversation
For the failing checks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Some initial comments here; I'll do a more deep read-through after these are addressed
{Pattern: regexp.MustCompile(`Abkhaz\-Cyrillic\+Abkh`), Encoding: charmap.Windows1251}, | ||
} | ||
|
||
skip := map[string]bool{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we skip these specific files? You have a comment saying "known to cause issues or are not relevant for the test.", but it's not clear what it means. In general, it's good practice to skip as little as possible. Is the problem with these files that we don't have encodings for them? I see many are UTF8, which is surprising; do you see difference in the results from the remote tokenizer for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skip is from https://github.com/nltk/nltk/blob/f6567388b4399000b9aa2a6b0db713bff3fe332a/nltk/corpus/reader/udhr.py#L14
I have tried my best to reduce the skip list but there are still some cases cannot be supported in Go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can you add a comment saying this? That the skip list comes from the NLTK source code which says these are unsupported encodings, or in general encodings Go doesn't support
defer client.Close() | ||
model := client.GenerativeModel(defaultModel) | ||
|
||
t.Run("RemoteAndLocalCountTokensTest", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this TestCount...
function only runs this one test, you don't need t.Run(...
(this is needed when you have sub-tests).
You can just place the test logic directly in the TestCount...
function itself
Like here, for example: https://github.com/googleapis/google-cloud-go/blob/main/vertexai/genai/client_test.go#L357
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is LGTM now, thanks for your patience! Just a couple of small comment tweaks requested
You'll also have to see why the kokoro tests are failing
Content []byte | ||
} | ||
|
||
// corporaGenerator is a generator function that returns a channel to iterate over files in the zip archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this comment now that there're no channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
internal/generated/snippets/go.mod
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this: I don't think your PR should update these files in internal/generated
-- this also triggers review requests by yoshi admins which shouldn't be required for a vertexai-only PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@eliben Thanks for reviewing. The testing workflows are green now. |
No description provided.