Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Change string comparing from UTF16 to UTF8 #3611

Merged
merged 12 commits into from
Aug 28, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import io.delta.kernel.internal.util.Utils;
import io.delta.kernel.types.*;
import java.math.BigDecimal;
import java.nio.charset.StandardCharsets;
import java.util.Comparator;
import java.util.List;
import java.util.function.Function;
Expand All @@ -34,7 +35,19 @@
class DefaultExpressionUtils {

static final Comparator<BigDecimal> BIGDECIMAL_COMPARATOR = Comparator.naturalOrder();
static final Comparator<String> STRING_COMPARATOR = Comparator.naturalOrder();
static final Comparator<String> STRING_COMPARATOR =
(leftOp, rightOp) -> {
byte[] leftBytes = leftOp.getBytes(StandardCharsets.UTF_8);
byte[] rightBytes = rightOp.getBytes(StandardCharsets.UTF_8);
int i = 0;
while (i < leftBytes.length && i < rightBytes.length) {
if (leftBytes[i] != rightBytes[i]) {
return Byte.toUnsignedInt(leftBytes[i]) - Byte.toUnsignedInt(rightBytes[i]);
}
i++;
}
return Integer.compare(leftBytes.length, rightBytes.length);
};
static final Comparator<byte[]> BINARY_COMPARTOR =
(leftOp, rightOp) -> {
int i = 0;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,24 @@ package io.delta.kernel.defaults

import io.delta.golden.GoldenTableUtils.goldenTablePath
import io.delta.kernel.exceptions.{InvalidTableException, KernelException, TableNotFoundException}
import io.delta.kernel.expressions.{Literal, ScalarExpression}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes in this file?

import io.delta.kernel.defaults.utils.{TestRow, TestUtils}
import io.delta.kernel.internal.fs.Path
import io.delta.kernel.internal.util.InternalUtils.daysSinceEpoch
import io.delta.kernel.internal.util.{DateTimeConstants, FileNames}
import io.delta.kernel.types.{LongType, StructType}
import io.delta.kernel.types.{BooleanType, LongType, StringType, StructType}
import io.delta.kernel.Table
import io.delta.kernel.data.ColumnVector
import io.delta.kernel.defaults.internal.data.DefaultColumnarBatch
import io.delta.kernel.defaults.internal.expressions.DefaultExpressionEvaluator
import org.apache.hadoop.shaded.org.apache.commons.io.FileUtils
import org.apache.spark.sql.functions.col
import org.scalatest.funsuite.AnyFunSuite

import java.io.File
import java.math.BigDecimal
import java.sql.Date
import java.util.Optional
import scala.collection.JavaConverters._

class DeltaTableReadsSuite extends AnyFunSuite with TestUtils {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -607,6 +607,44 @@ class DefaultExpressionEvaluatorSuite extends AnyFunSuite with ExpressionSuiteBa
),
(ofDate(-12123), ofDate(123123), ofDate(-12123), ofNull(DateType.DATE)),
(ofString("apples"), ofString("oranges"), ofString("apples"), ofNull(StringType.STRING)),
(
ofString("apples"),
ofString("oranges"),
ofString("apples"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same test case as the one above?

(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
(ofString("abc"), ofString("abcd"), ofString("abc"), ofNull(STRING)),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for other test cases.

(
ofString("abc"),
ofString("abd"),
ofString("abc"),
ofNull(StringType.STRING)
),
(
// scalastyle:off nonascii
ofString("A"),
ofString("\u0100"),
ofString("A"),
ofNull(StringType.STRING)
),
(
ofString("\00BB"),
ofString("\u00EE"),
ofString("\00BB"),
ofNull(StringType.STRING)
),
(
ofString("\uFFFD"),
ofString("\uD83C\uDF3C"),
ofString("\uFFFD"),
ofNull(StringType.STRING)
// scalastyle:on nonascii
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there more tests we could copy for Spark or Delta-Spark around other surrogate pairs and unsigned char comparison?

),
(
ofBinary("apples".getBytes()),
ofBinary("oranges".getBytes()),
Expand Down
Loading