Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Change string comparing from UTF16 to UTF8 #3611

Merged
merged 12 commits into from
Aug 28, 2024

Conversation

ilicmarkodb
Copy link
Contributor

@ilicmarkodb ilicmarkodb commented Aug 27, 2024

Description

Changed string comparing from UTF16 to UTF8. This fixes comparison issues around the characters with surrogate pairs.

How was this patch tested?

Tests added to DefaultExpressionEvaluatorSuite.scala

@ilicmarkodb ilicmarkodb changed the title Fix string comparator Changed string comparing from UTF16 to UTF8. Aug 27, 2024
@ilicmarkodb ilicmarkodb changed the title Changed string comparing from UTF16 to UTF8. Changed string comparing from UTF16 to UTF8 Aug 27, 2024
@ilicmarkodb ilicmarkodb changed the title Changed string comparing from UTF16 to UTF8 [KERNEL] Changed string comparing from UTF16 to UTF8 Aug 27, 2024
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and fixing it.

@@ -17,19 +17,24 @@ package io.delta.kernel.defaults

import io.delta.golden.GoldenTableUtils.goldenTablePath
import io.delta.kernel.exceptions.{InvalidTableException, KernelException, TableNotFoundException}
import io.delta.kernel.expressions.{Literal, ScalarExpression}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes in this file?

Comment on lines 610 to 615
(
ofString("apples"),
ofString("oranges"),
ofString("apples"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same test case as the one above?

Comment on lines 616 to 621
(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(
ofString("abc"),
ofString("abcd"),
ofString("abc"),
ofNull(StringType.STRING)
),
(ofString("abc"), ofString("abcd"), ofString("abc"), ofNull(STRING)),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for other test cases.

ofString("\uD83C\uDF3C"),
ofString("\uFFFD"),
ofNull(StringType.STRING)
// scalastyle:on nonascii
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there more tests we could copy for Spark or Delta-Spark around other surrogate pairs and unsigned char comparison?

@vkorukanti vkorukanti changed the title [KERNEL] Changed string comparing from UTF16 to UTF8 [Kernel] Change string comparing from UTF16 to UTF8 Aug 28, 2024
@vkorukanti vkorukanti merged commit f468733 into delta-io:master Aug 28, 2024
16 checks passed
vkorukanti pushed a commit to vkorukanti/delta that referenced this pull request Aug 30, 2024
Changed string comparing from UTF16 to UTF8. This fixes comparison
issues around the characters with surrogate pairs.

Tests added to `DefaultExpressionEvaluatorSuite.scala`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants