-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8951][SparkR] support Unicode characters in collect() #7494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins, ok to test |
|
Test build #37723 has finished for PR 7494 at commit
|
|
I am not sure about |
|
I think readString() in deserialize.R should be updated accordingly. Could you try: |
|
@sun-rui It needs to call rawToChar(), R gives an error message below otherwise. |
|
yeah, rawToChar() is needed. Then does it work now? |
|
Unfortunately, not. |
|
Could you try adding a zero as done previously in writeString(): For those unicode strings in the test case, not sure if need to force the encoding of them to be "UTF-8" before writing them to a JSON file. Seems not necessary. |
|
Adding a zero doesn't solve the problem either. It seems to have the same effect as the one without it. test the code under "LC_ALL=C" |
|
Test build #38304 has finished for PR 7494 at commit
|
|
good job, jenkins. |
R/pkg/R/deserialize.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect I made a mistake here. It semms that enc2native just change the encoding mark, but does not convert the character from current encoding to the native encoding (iconv() can do this). As described in https://cran.r-project.org/doc/manuals/r-release/R-ints.html#Encodings-for-CHARSXPs, R can encodings other than that of the current locale, and most of the character manipulation functions now preserve UTF-8 encodings, we can keep the string encoding as "UTF-8". So could you try to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Perserving UTF-8 encodings sounds much better.
Do you mean that enc2native should be removed like the below?
readString <- function(con) {
stringLen <- readInt(con)
raw <- readBin(con, raw(), stringLen, endian = "big")
string <- rawToChar(raw)
Encoding(string) <- "UTF-8"
string
}
But, this makes an error in calling collect() following createDataFrame(), which converts a local R dataframe to spark RDD Dataframe.
5. Error: collect() support Unicode characters ---------------------------------
unused argument (connection = NULL)
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"),
warning = function(c) invokeRestart("muffleWarning"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: createDataFrame(sqlContext, rdf) at test_sparkSQL.R:65
5: parallelize(sc, data)
6: lapply(slices, writeObject, connection = NULL)
7: lapply(slices, writeObject, connection = NULL)
8: FUN(X[[i]], ...)
9: .handleSimpleError(function (e)
{
e$calls <- head(sys.calls()[-seq_len(frame + 7)], -2)
signalCondition(e)
}, "unused argument (connection = NULL)", quote(FUN(X[[i]], ...)))
Error: Test failures
I tried to find out the reason and I guess the MSB of a string is set when I use Encoding(string)<-"UTF-8", which is not otherwise.
I printed out serializedSlices in paralleize(), line 132. The result is like the below.
context.R
102 parallelize <- function(sc, coll, numSlices = 1) {
103 # TODO: bound/safeguard numSlices
104 # TODO: unit tests for if the split works for all primitives
105 # TODO: support matrix, data frame, etc
106 if ((!is.list(coll) && !is.vector(coll)) || is.data.frame(coll)) {
107 if (is.data.frame(coll)) {
108 message(paste("context.R: A data frame is parallelized by columns."))
109 } else {
110 if (is.matrix(coll)) {
111 message(paste("context.R: A matrix is parallelized by elements."))
112 } else {
113 message(paste("context.R: parallelize() currently only supports lists and vectors.",
114 "Calling as.list() to coerce coll into a list."))
115 }
116 }
117 coll <- as.list(coll)
118 }
119
120 print(coll)
121
122 if (numSlices > length(coll))
123 numSlices <- length(coll)
124
125 sliceLen <- ceiling(length(coll) / numSlices)
126 slices <- split(coll, rep(1:(numSlices + 1), each = sliceLen)[1:length(coll)])
127
128 # Serialize each slice: obtain a list of raws, or a list of lists (slices) of
129 # 2-tuples of raws
130 serializedSlices <- lapply(slices, serialize, connection = NULL)
131
132 print(serializedSlices)
133 jrdd <- callJStatic("org.apache.spark.api.r.RRDD",
134 "createRDDFromArray", sc, serializedSlices)
135
136 RDD(jrdd, "byte")
137 }
Case 1. with Encoding(string) <- "UTF-8"
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 00 00
[26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 00 10
[51] 00 00 00 01 00 00 80 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 b8 ec
[76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 06 e6 82 a8 e5 a5 bd 00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 00 00
[151] 10 00 00 00 01 00 00 80 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 80 09 00 00 00 09 58 69 6e 20 63 68 c3
[226] a0 6f
Case 2. without Encoding(string) <- "UTF-8"
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 13 00 00 00 04 00 00 00
[26] 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00 07 a2 00 00 00 10
[51] 00 00 00 01 00 00 00 09 00 00 00 0f ec 95 88 eb 85 95 ed 95 98 ec 84 b8 ec
[76] 9a 94 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 3e 00 00 00 00 00
[101] 00 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 06 e6 82 a8 e5 a5 bd 00 00
[126] 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 40 33 00 00 00 00 00 00 00 00 00
[151] 10 00 00 00 01 00 00 00 09 00 00 00 0f e3 81 93 e3 82 93 e3 81 ab e3 81 a1
[176] e3 81 af 00 00 00 13 00 00 00 02 00 00 00 0e 00 00 00 01 7f f0 00 00 00 00
[201] 07 a2 00 00 00 10 00 00 00 01 00 00 00 09 00 00 00 09 58 69 6e 20 63 68 c3
[226] a0 6f
You can see [51], [101], [151], [201] are different. There is a leading 80 before 09 with Encoding() which, I guess, makes an error.
I think this is the encoding indication bit in R according to the link you gave.
|
@sun-rui @CHOIJAEHONG1 I took a look at this and it looks like the right way to do this would be to use |
|
@shivaram |
|
I'm not sure how why expect the tests to pass with |
|
Firstly, I am not able to enter multibyte character sequences in the R shell under |
|
I see - so the problem here is on how to write a unit test that uses UTF-8 and works with |
|
@shivaram The testcase passed with context.R deserializer.R testcase |
|
Can we still use the |
|
Okay, using Also, I think introducing deserialize.R context.R |
|
@CHOIJAEHONG1 This diff is looking good. Can you update the PR with this ? BTW why do we need to clear the encoding bit in |
|
The testcase fails in For instance, But, After removing the encoding bit by The following log message is from I test how R strings work with different With with with with |
|
Yeah I think the right thing to do is to make sure Line 80 in 738f353
iconv there as well
|
|
@CHOIJAEHONG1, @shivaram:
for example, x<-"\uac00": I suspect this is a bug of writeBin, because it is expected that the internal representation of the string be written, instead of its display representation. We can fix the this issue be changing: readString() should be changed accordingly, as there is no trailing zero.
|
|
@CHOIJAEHONG1 , any update on this? |
|
Sorry for being late. I got this error message in The output. |
|
I call https://stat.ethz.ch/R-manual/R-devel/library/base/html/writeLines.html
It seems to work. @shivaram |
|
Timeout occurred while fetching from the origin. |
|
Jenkins retest this please |
|
Test build #41629 has finished for PR 7494 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a little bit confused about the behavior of treating unicode string in non-UTF8 local. Why no need to makeUtf8 for these unicode strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is related to the R interpreter. I made an example to see how the R interpreter handle unicode strings as to the UTF-8 indicator with different locales.
You can see R adds 80(maybe, the UTF-8 indicator) before 09 with the UTF-8 locale automatically. I think this example could explain the reason why the testcase passes under the UTF-8 locale without using markUtf8.
test.R
a<-"가"
Encoding(a)
print(serialize(a, connection=NULL))
with UTF-8 locale the output is
$ r -f test.R
> a<-"가"
> Encoding(a)
[1] "UTF-8"
> print(serialize(a, connection=NULL))
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 80
[26] 09 00 00 00 03 ea b0 80
with C(ascii) locale the output is
$ r -f test.R
> a<-"가"
> Encoding(a)
[1] "unknown"
> print(serialize(a, connection=NULL))
[1] 58 0a 00 00 00 02 00 03 02 00 00 02 03 00 00 00 00 10 00 00 00 01 00 00 00
[26] 09 00 00 00 03 ea b0 80
|
@CHOIJAEHONG1 , basically LGTM. Some minor comment. |
|
Thanks @CHOIJAEHONG1 and @sun-rui -- I just want to test this / go through this a bit carefully once more as its a pretty fundamental change in how we handle strings. Will take another look at this soon |
|
@CHOIJAEHONG1 Sorry for the delay. I finally got a chance to try this out on my machine and it seems to work fine with UTF-8 strings with both the locale set to I'm going to merge this to master branch and as this is early in the 1.6 window we also get a chance for this to get tested thoroughly. |
Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.