-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Integrate HuggingFace tokenizer #2311
Conversation
Codecov ReportBase: 72.08% // Head: 74.14% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #2311 +/- ##
============================================
+ Coverage 72.08% 74.14% +2.05%
- Complexity 5126 6788 +1662
============================================
Files 473 667 +194
Lines 21970 29566 +7596
Branches 2351 3057 +706
============================================
+ Hits 15838 21921 +6083
- Misses 4925 6160 +1235
- Partials 1207 1485 +278
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add encode/decode methods. These two are commonly used for text vectorization and reverse
* [[TokenizerFactory]] contains tokenizer creation mechanism on top of different platforms. | ||
* System will choose appropriate Factory based on the supported audio type. | ||
*/ | ||
class TokenizerFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add TokenizerFactory in our api package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe leave it here for now and refactor later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
b0344f9
to
da2c473
Compare
f400040
to
d779b1e
Compare
Description
Brief description of what this PR is about
Add HuggingFace tokenizer in Spark extension