-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct wcwidth computation for pretty outputs. #3257
Conversation
Great, we always wanted it! |
dbms/src/Common/UTF8Helpers.cpp
Outdated
|
||
size_t computeWidth(const UInt8 * data, size_t size) | ||
{ | ||
std::wstring wstr = std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t>{}.from_bytes(reinterpret_cast<const char *>(data), reinterpret_cast<const char *>(data + size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
en.cppreference.com states that codecvt_utf8
is deprecated in C++17. Why?
dbms/src/Common/UTF8Helpers.cpp
Outdated
|
||
size_t computeWidth(const UInt8 * data, size_t size) | ||
{ | ||
std::wstring wstr = std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t>{}.from_bytes(reinterpret_cast<const char *>(data), reinterpret_cast<const char *>(data + size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read that it will throw on invalid UTF-8 sequence:
https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
It's better to use replacement character �
(as your terminal does when invalid sequence is received)
dbms/src/Common/UTF8Helpers.h
Outdated
@@ -72,6 +72,8 @@ inline size_t countCodePoints(const UInt8 * data, size_t size) | |||
return res; | |||
} | |||
|
|||
size_t computeWidth(const UInt8 * data, size_t size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comment.
dbms/src/Common/UTF8Helpers.cpp
Outdated
auto w = widechar_wcwidth(c); | ||
if (w == -2) | ||
width += 1; | ||
else if (w > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No cases for another special values.
dbms/src/Common/UTF8Helpers.cpp
Outdated
for (wchar_t c : wstr) | ||
{ | ||
auto w = widechar_wcwidth(c); | ||
if (w == -2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use enum values.
}; | ||
|
||
/* An inclusive range of characters. */ | ||
struct widechar_range { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This library is implemented exactly as I thought.
@@ -27,3 +27,4 @@ if (USE_INTERNAL_CONSISTENT_HASHING_LIBRARY) | |||
add_subdirectory (consistent-hashing) | |||
endif () | |||
add_subdirectory (consistent-hashing-sumbur) | |||
add_subdirectory (libwidechar_width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing LICENCE and README in the library directory.
It must mention the exact source (URL, commit) of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but minor changed required.
Missing test. Test should have full-width characters (Chinese), zero-width characters, |
@alexey-milovidov almost done. I don't find anything related to client testing. What should one look like?
Can you post a csv file with that letter? I'm not sure if I can generate it correctly. |
@amosbird FYI, I hope that'll be helpful : ) |
@zhang2014 Thanks. Those |
43d04e2
to
b9f1d06
Compare
namespace UTF8 | ||
{ | ||
|
||
// based on https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please copy the license from this page verbatim here in comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
int width = widechar_wcwidth(wc); | ||
switch (width) | ||
{ | ||
case widechar_nonprint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add [[fallthrough]];
We have warnings in some compilers.
dbms/src/Common/UTF8Helpers.cpp
Outdated
// special treatment for '\t' | ||
if (decoder.codepoint == '\t') | ||
{ | ||
do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite with simple formula?
dbms/src/Common/UTF8Helpers.cpp
Outdated
break; | ||
// continue if we meet other values here | ||
default: | ||
rollback++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use prefix increment according to style guide.
dbms/src/Common/UTF8Helpers.h
Outdated
@@ -72,6 +72,9 @@ inline size_t countCodePoints(const UInt8 * data, size_t size) | |||
return res; | |||
} | |||
|
|||
/// returns UTF-8 wcswidth. Invalid sequence is treated as zero width character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment should decribe "prefix" parameter because its usage is non-obvious.
a63a87f
to
9642496
Compare
And finally... look for the test |
OK, updated 298's result |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
https://github.com/ridiculousfish/widecharwidth license is pretty clean.
As
PrettyBlockOutputStream
andVerticalRowOutputStream
are consumed by human, readability excels performance.Before,
After,