Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct wcwidth computation for pretty outputs. #3257

Merged
merged 1 commit into from
Oct 2, 2018

Conversation

amosbird
Copy link
Collaborator

@amosbird amosbird commented Sep 29, 2018

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

https://github.com/ridiculousfish/widecharwidth license is pretty clean.

As PrettyBlockOutputStream and VerticalRowOutputStream are consumed by human, readability excels performance.

Before,

image

After,

image

@alexey-milovidov
Copy link
Member

Great, we always wanted it!


size_t computeWidth(const UInt8 * data, size_t size)
{
std::wstring wstr = std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t>{}.from_bytes(reinterpret_cast<const char *>(data), reinterpret_cast<const char *>(data + size));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

en.cppreference.com states that codecvt_utf8 is deprecated in C++17. Why?


size_t computeWidth(const UInt8 * data, size_t size)
{
std::wstring wstr = std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t>{}.from_bytes(reinterpret_cast<const char *>(data), reinterpret_cast<const char *>(data + size));
Copy link
Member

@alexey-milovidov alexey-milovidov Sep 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read that it will throw on invalid UTF-8 sequence:
https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
It's better to use replacement character �
(as your terminal does when invalid sequence is received)

@@ -72,6 +72,8 @@ inline size_t countCodePoints(const UInt8 * data, size_t size)
return res;
}

size_t computeWidth(const UInt8 * data, size_t size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comment.

auto w = widechar_wcwidth(c);
if (w == -2)
width += 1;
else if (w > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cases for another special values.

for (wchar_t c : wstr)
{
auto w = widechar_wcwidth(c);
if (w == -2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use enum values.

};

/* An inclusive range of characters. */
struct widechar_range {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This library is implemented exactly as I thought.

@@ -27,3 +27,4 @@ if (USE_INTERNAL_CONSISTENT_HASHING_LIBRARY)
add_subdirectory (consistent-hashing)
endif ()
add_subdirectory (consistent-hashing-sumbur)
add_subdirectory (libwidechar_width)
Copy link
Member

@alexey-milovidov alexey-milovidov Sep 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing LICENCE and README in the library directory.
It must mention the exact source (URL, commit) of the library.

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but minor changed required.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Sep 29, 2018

Missing test.

Test should have full-width characters (Chinese), zero-width characters,
two variants of Russian letter ё:
ё - d191 in UTF-8
ё - d0b5cc88 in UTF-8
as long as some invalid UTF-8 sequence,
and a string with zero bytes in the middle.

@amosbird
Copy link
Collaborator Author

@alexey-milovidov almost done. I don't find anything related to client testing. What should one look like?

two variants of Russian letter ё:

Can you post a csv file with that letter? I'm not sure if I can generate it correctly.

@zhang2014
Copy link
Contributor

@amosbird FYI, I hope that'll be helpful : )
image
image

@amosbird
Copy link
Collaborator Author

@zhang2014 Thanks. Those ёs are all of width 1 from what I can tell. I assume there is a variant that need special treatment.

@amosbird amosbird force-pushed the master branch 2 times, most recently from 43d04e2 to b9f1d06 Compare September 30, 2018 08:33
@amosbird
Copy link
Collaborator Author

hmm, simple copy paste of that ё just works. Looks a bit weird in terminal though
image

namespace UTF8
{

// based on https://bjoern.hoehrmann.de/utf-8/decoder/dfa/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please copy the license from this page verbatim here in comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int width = widechar_wcwidth(wc);
switch (width)
{
case widechar_nonprint:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add [[fallthrough]];
We have warnings in some compilers.

// special treatment for '\t'
if (decoder.codepoint == '\t')
{
do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite with simple formula?

break;
// continue if we meet other values here
default:
rollback++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use prefix increment according to style guide.

@@ -72,6 +72,9 @@ inline size_t countCodePoints(const UInt8 * data, size_t size)
return res;
}

/// returns UTF-8 wcswidth. Invalid sequence is treated as zero width character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should decribe "prefix" parameter because its usage is non-obvious.

@amosbird amosbird force-pushed the master branch 2 times, most recently from a63a87f to 9642496 Compare October 1, 2018 03:35
@alexey-milovidov
Copy link
Member

And finally... look for the test
00298_enum_width_and_cast
https://travis-ci.org/yandex/ClickHouse/builds/435426678#L8971

@amosbird
Copy link
Collaborator Author

amosbird commented Oct 2, 2018

OK, updated 298's result

@alexey-milovidov alexey-milovidov merged commit 147a2a1 into ClickHouse:master Oct 2, 2018
alexey-milovidov added a commit that referenced this pull request Oct 9, 2018
alexey-milovidov added a commit that referenced this pull request Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants