-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
truncate.char misbehaves for multibyte characters #5096
Comments
Hi! I'm interested in working on this issue - may I be assigned? Thanks! |
Hi @MichaelChirico, I was wondering if there is an MRE you had for this issue. I noticed that copying multibyte characters such as Chinese or Kanji works fine with print, eg: DT <- data.table('一', '二', '三', '四', '五') output: Which seems correct to me. Another case I tried was trying to print data tables with unicode: dt <- data.table('\x65') output: and #(x100 == U+0100, which is the latin capital A with Macron: Ā) output: which obviously doesn't look right, but I'm not 100% sure if this is the case that the issue is intended to fix. |
this constraint needs to be binding for the multi byte issue to bind: data.table/R/print.data.table.R Line 235 in 815e054
|
@MichaelChirico sorry for the pings, it's my first time contributing so I want to make sure my understanding is correct. I experimented with this but I can't seem to find many examples where type='width' will produce a better output than the default 'char'.
|
idx = which(nchar(x) > trunc.char) |
we incorrectly return the character vector with ". . ."
concatenated to the end unless we have datatable.prettyprint.char >= 2 * nchar(x)
.
However I found one niche case where type='width'
is better, when we use combining characters such as "á"
with the Latin "a"
and the combining acute accent.
Printing combining characters with character truncating.
options(datatable.prettyprint.char=1L)
DT = data.table("á")
print(DT)
## output with type='char'
# V1
# <char>
# 1: a...
## output with type='width'
# V1
# <char>
# 1: á
I believe this is because nchar()
with type='char'
counts "á"
as 2 distinct characters and 'width'
correctly approximates it to only have a width of 1.
Please advise if I am understanding the problem wrong, thanks!
Thanks for the investigation! I confirm your findings and agree that it may not be so simple as just changing to
PS, it seems |
I'm currently looking into some solutions for the failing combining characters case, and I came up with a relatively simple solution: char.trunc = function(x, trunc.char = getOption("datatable.prettyprint.char")) {
trunc.char = max(0L, suppressWarnings(as.integer(trunc.char[1L])), na.rm=TRUE)
if (!is.character(x) || trunc.char <= 0L) return(x)
widthSize = nchar(x, 'width')
charSize = nchar(x)
isFull = widthSize > charSize
idx = which(pmin(widthSize, charSize) > trunc.char)
x[idx] = paste0(strtrim(x[idx], as.integer(ifelse(isFull, trunc.char * 2, trunc.char))), "...")
x
} I ran this against some basic tests and I'm quite confident that this covers the combined, full-width, and half-width character cases correctly. However, there is an issue with printing strings that consist of a mixture of combined characters and other characters, because both If the proposed solution is viable, I was wondering if I can create a PR for this issue. As for the missing documentation, I noticed that as well when I was working on my repro, and I would love to help with adding that as well! I'm just not 100% sure if I should work on both issues in the same PR or not. |
Thanks for your thorough investigation! Please proceed to a PR.
Let's try and keep one issue per PR. The documentation issue is now #5994. Performance shouldn't be much of a concern, since |
This is a follow-up to the comments in #3414
Basically print.data.table internals currently use nchar() for printing in a way that won't work well with multibyte characters. it should be as simple as using nchar(., type='width') in the internal truncate.char and then writing tests.
The text was updated successfully, but these errors were encountered: