Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertexai(test): add corpora_test in tokenizer module #10784

Merged
merged 10 commits into from
Sep 6, 2024

Conversation

happy-qiao
Copy link
Contributor

No description provided.

@happy-qiao happy-qiao requested review from a team as code owners August 28, 2024 03:24
@codyoss codyoss changed the title test: add corpora_test in tokenizer module vertexai(test): add corpora_test in tokenizer module Aug 28, 2024
@eliben
Copy link
Contributor

eliben commented Aug 28, 2024

For the failing checks:

  1. vet - you can run it from the .github/workflows directory locally on your machine to weed out all the failures before sending an updated PR
  2. kokoro runs tests and reports failures with a link (e.g. https://source.cloud.google.com/results/invocations/a730c99f-eda9-4a84-84f3-d0b4467b1bf5 here)

Copy link
Contributor

@eliben eliben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Some initial comments here; I'll do a more deep read-through after these are addressed

vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
{Pattern: regexp.MustCompile(`Abkhaz\-Cyrillic\+Abkh`), Encoding: charmap.Windows1251},
}

skip := map[string]bool{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we skip these specific files? You have a comment saying "known to cause issues or are not relevant for the test.", but it's not clear what it means. In general, it's good practice to skip as little as possible. Is the problem with these files that we don't have encodings for them? I see many are UTF8, which is surprising; do you see difference in the results from the remote tokenizer for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip is from https://github.com/nltk/nltk/blob/f6567388b4399000b9aa2a6b0db713bff3fe332a/nltk/corpus/reader/udhr.py#L14

I have tried my best to reduce the skip list but there are still some cases cannot be supported in Go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you add a comment saying this? That the skip list comes from the NLTK source code which says these are unsupported encodings, or in general encodings Go doesn't support

vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
defer client.Close()
model := client.GenerativeModel(defaultModel)

t.Run("RemoteAndLocalCountTokensTest", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this TestCount... function only runs this one test, you don't need t.Run(... (this is needed when you have sub-tests).

You can just place the test logic directly in the TestCount... function itself
Like here, for example: https://github.com/googleapis/google-cloud-go/blob/main/vertexai/genai/client_test.go#L357

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vertexai/genai/tokenizer/corpora_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@eliben eliben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is LGTM now, thanks for your patience! Just a couple of small comment tweaks requested

You'll also have to see why the kokoro tests are failing

Content []byte
}

// corporaGenerator is a generator function that returns a channel to iterate over files in the zip archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this comment now that there're no channels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this: I don't think your PR should update these files in internal/generated -- this also triggers review requests by yoshi admins which shouldn't be required for a vertexai-only PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@happy-qiao
Copy link
Contributor Author

happy-qiao commented Sep 6, 2024

@eliben Thanks for reviewing. The testing workflows are green now.

@eliben eliben added the automerge Merge the pull request once unit tests and other checks pass. label Sep 6, 2024
@gcf-merge-on-green gcf-merge-on-green bot merged commit ce82b22 into googleapis:main Sep 6, 2024
10 checks passed
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants