-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed trunc.char to better handle combining chars and full-width multibyte chars #5995
Conversation
The Appveyor failure is correct -- those characters may not parse correctly on Windows. I think we need to use Unicode escape equivalents instead. |
By this do you mean in "á" -> "\u0061\u0301" |
Yes, I see other similar examples in tests.Rraw you can search for |
Hmm still failing on windows. I'm not very familiar with how to get this working on windows, but take a look at the test failure output on Appveyor to see if it makes sense to you. |
It seems that the Appveyor test likes this format of test better: test(2248.01, capture.output(print(DT))[2], paste("1:", strrep(accented_a, 4L))) so instead of giving us "Test 2248.xx did not produce the correct result" the error becomes mismatched strings:
Maybe because with the Latin-1 charset some Windows machines have can't resolve the acute accent part of accented a: |
@MichaelChirico I'm kind of out of ideas with how to fix this Appveyor test... The problem to me looks like to be with the default encoding of the testing environment passing the strings to
it passes instead:
and causes issues with the truncation itself. I can't for sure tell if this is a case that options(datatable.prettyprint.char=8L)
accented_a = "\u0061\u0301"; Encoding(accented_a) = 'latin1'
DT = data.table(strrep(accented_a, 4L))
print(DT)
# output:
# V1
# <char>
# 1: aÌ<81>aÌ<81>aÌ... I'm not sure what to do with this PR moving forward, because I don't have a good idea of (or just haven't come across) a good way to test these changes in all the different default encodings. PS: Am I able to open different PRs with this one still being open? I want to explore some different issues as I'm preparing to apply for GSoC with |
thanks for your efforts & sorry the testing is turning out to be painful 😅 don't worry, I'll try and get this across the finish line when I get a chance. for now, yes, absolutely feel free to open other PRs, there's no need to go one PR at a time. |
Thanks so much! As I mentioned before, this was one of my first contributions to open source ever and you've made the experience super enjoyable and I really learned a lot! |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5995 +/- ##
=======================================
Coverage 97.53% 97.53%
=======================================
Files 80 80
Lines 14916 14919 +3
=======================================
+ Hits 14548 14551 +3
Misses 368 368 ☔ View full report in Codecov by Sentry. |
HTH! It's quite an interesting first PR & I hope you're not deterred! Encoding remains one of the messier problems to try and solve. You might like to read 1 2 3 for some more advanced reading, I still refer to those whenever I need to handle something particularly complicated like here. |
Oops, looks like I accidentally closed while messing around with git. Not sure how to recover this but I don't mind remaking the changes with another PR, I think it should be a cleaner PR as well, I wanted to take another crack at passing the Windows build test. Do you think that is okay @MichaelChirico? |
You can reopen the PR or open a new one (up to you). From a developing/educating point of view I would find it more useful to clean up this one since it also contains the history of things that did not work (lessons learned). |
Thanks! I'm not an expert with git so I'm going to avoid messing around with it anymore - so I think I'll open a new PR. Gives me a chance to work through the issue again to refresh my mind. |
Closes #5096
The old version of
trunc.char
indata.table.print
usednchar(x)
internally which works fine in most cases except when dealing with combining characters such as"á"
(The Latin "a" and accent) asnchar(x)
recognizes two distinct readable characters. In this case usingnchar(x, 'width')
works as intended.This PR changes
trunc.char
to recognize when the char is combining or full-width and indexes accordingly. Additionally,strtrim()
is used now in place ofsubstr()
because as mentioned in?substr
: