-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java|Go] Check string is ascii before using meta string encoding #1619
Comments
Hi @chaokunyang I'd like to contribute to Apache Fury, can you please assign this issue to me? |
Great! thanks for the willingness to contribute to Fury. |
## What does this PR do? <!-- Describe the purpose of this PR. --> This PR introduces a validation method to ensure that all input strings to the `MetaString` encoder are ASCII. ## Related issues <!-- Is there any related issue? Please attach here. - #1619 - #xxxx1 - #xxxx2 --> ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. --> --------- Signed-off-by: Jason Mok <jjasonmok1@gmail.com>
// org/apache/fury/meta/MetaStringEncoder.java
public MetaString encode(String input) {
if (input.isEmpty()) {
return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
}
Encoding encoding = computeEncoding(input);
return encode(input, encoding);
} Could we judge here whether the WDYT @chaokunyang |
In addition, we also need to add unit tests to cover this issue |
@LiangliangSui Thanks for bringing that up, can I submit another PR to correct that/add unit tests? I also think it would be optimal to have the ASCII check early and just directly return a UTF-8 encoded |
Sure, that is great! |
<!-- **Thanks for contributing to Fury.** **If this is your first time opening a PR on fury, you can refer to [CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).** Contribution Checklist - The **Apache Fury (incubating)** community has restrictions on the naming of pr titles. You can also find instructions in [CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md). - Fury has a strong focus on performance. If the PR you submit will have an impact on performance, please benchmark it first and provide the benchmark result here. --> ## What does this PR do? <!-- Describe the purpose of this PR. --> This PR enhances the current ASCII check (before meta string encoding) I implemented in #1620 to return a UTF-8 encoded `MetaString` early if the input is non-ASCII. This improves efficiency and saves time on `computeEncoding` and `encode`. Unit tests are also added. ## Related issues #1619 ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. --> --------- Signed-off-by: Jason Mok <jjasonmok1@gmail.com>
Is your feature request related to a problem? Please describe.
In #1514 and #1566 , we compress every char using 5/6 bytes. But we didn't check the encoding is ascii. In UTF-8, some byte may be in range of ascii char, but some not. We may take a utf-8 string as a meta string by accident?
Describe the solution you'd like
Check string is ascii before using meta string encoding
Additional context
The text was updated successfully, but these errors were encountered: