Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Integrate HuggingFace tokenizer #2311

Merged
merged 1 commit into from
Feb 14, 2023

Conversation

xyang16
Copy link
Contributor

@xyang16 xyang16 commented Jan 9, 2023

Description

Brief description of what this PR is about

Add HuggingFace tokenizer in Spark extension

@xyang16 xyang16 changed the title Add HuggingFace tokenizer in Spark extension [WIP] Add HuggingFace tokenizer in Spark extension Jan 9, 2023
@xyang16 xyang16 requested a review from lanking520 January 9, 2023 18:35
@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2023

Codecov Report

Base: 72.08% // Head: 74.14% // Increases project coverage by +2.05% 🎉

Coverage data is based on head (339761c) compared to base (bb5073f).
Patch coverage: 73.74% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2311      +/-   ##
============================================
+ Coverage     72.08%   74.14%   +2.05%     
- Complexity     5126     6788    +1662     
============================================
  Files           473      667     +194     
  Lines         21970    29566    +7596     
  Branches       2351     3057     +706     
============================================
+ Hits          15838    21921    +6083     
- Misses         4925     6160    +1235     
- Partials       1207     1485     +278     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java 76.00% <ø> (ø)
...rc/main/java/ai/djl/modality/cv/output/Joints.java 71.42% <ø> (ø)
.../main/java/ai/djl/modality/cv/output/Landmark.java 100.00% <ø> (ø)
...main/java/ai/djl/modality/cv/output/Rectangle.java 72.41% <0.00%> (ø)
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <0.00%> (-5.24%) ⬇️
.../modality/cv/translator/ImageFeatureExtractor.java 0.00% <0.00%> (ø)
.../ai/djl/modality/cv/translator/YoloTranslator.java 27.77% <0.00%> (+18.95%) ⬆️
...ain/java/ai/djl/modality/cv/util/NDImageUtils.java 67.10% <0.00%> (+7.89%) ⬆️
api/src/main/java/ai/djl/modality/nlp/Decoder.java 63.63% <ø> (ø)
... and 633 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@lanking520 lanking520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add encode/decode methods. These two are commonly used for text vectorization and reverse

@xyang16 xyang16 changed the title [WIP] Add HuggingFace tokenizer in Spark extension Add HuggingFace tokenizer in Spark extension Jan 27, 2023
* [[TokenizerFactory]] contains tokenizer creation mechanism on top of different platforms.
* System will choose appropriate Factory based on the supported audio type.
*/
class TokenizerFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add TokenizerFactory in our api package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave it here for now and refactor later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@xyang16 xyang16 force-pushed the tokenizer branch 9 times, most recently from b0344f9 to da2c473 Compare February 13, 2023 20:04
@xyang16 xyang16 changed the title Add HuggingFace tokenizer in Spark extension [spark] Integrate HuggingFace tokenizer Feb 13, 2023
@xyang16 xyang16 force-pushed the tokenizer branch 4 times, most recently from f400040 to d779b1e Compare February 14, 2023 19:52
@xyang16 xyang16 merged commit e53b161 into deepjavalibrary:master Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants