Skip to content

Conversation

@msmygit
Copy link
Member

@msmygit msmygit commented Oct 21, 2025

Fixes #508

@msmygit msmygit self-assigned this Oct 21, 2025
@msmygit msmygit marked this pull request as ready for review October 28, 2025 01:22
@msmygit msmygit requested a review from absurdfarce December 23, 2025 22:36
/**
* Test for issue #508: JSON connector should preserve unicode escape sequences.
*/
class JsonNodeToStringCodecUnicodeTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So rather than creating a new test class for this we should be able to add this to the JsonNodeToStringCodecTest. We already have a should_convert_from_valid_external method which verifies the ability of that codec to handle legit incoming JsonNodes... and IIUC this is just another case of that. Something like the following should do it:

      String unicodeEscapeString = "\\u001a\\u001aL\\\\";
      String mixedContentString = "Text with \\\\u001a unicode \\\\u001aL\\\\\\\\ escapes";
      assertThat(codec)
              .convertsFromExternal(nodeFactory.textNode(unicodeEscapeString))
              .toInternal(unicodeEscapeString)
              .convertsFromExternal(nodeFactory.textNode(mixedContentString))
              .toInternal(mixedContentString);

Thing is... if I add that test to 1.x and run it (with no additional modifications, specifically without the changes to JsonNodeToStringCodec also included in this PR) the test already passes. So are we sure there isn't something else going on here?

@absurdfarce
Copy link
Collaborator

Hey @msmygit, I built a distribution from the tag of 1.11.0 + your change and I'm still seeing bogus results:

CREATE KEYSPACE test508 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE test508.codec_test (
    i text PRIMARY KEY,
    j text
);
$ more ../508.json 
[{"i":"json","j":"NO\u001a\u001aL\\"}]
$ more ../508.csv 
i,j
csv,"NO\u001a\u001aL\\"
bin/dsbulk load -k test508 -t codec_test -c json -url ../508.json  --dsbulk.connector.json.mode SINGLE_DOCUMENT
bin/dsbulk load -k test508 -t codec_test -c csv -url ../508.csv
[cqlsh 6.2.0 | Cassandra 5.0.4 | CQL spec 3.4.7 | Native protocol v5]
Use HELP for help.
cqlsh> select * from test508.codec_test ;

 i	| j
------+------------------
  csv | NO\u001a\u001aL\
 json | 	NO\x1a\x1aL\

(2 rows)

Did I muck something up in the above?

@absurdfarce
Copy link
Collaborator

It looks like I did miss the --dsbulk.codec.binary HEX param but even if I include that the results don't seem to change

@absurdfarce
Copy link
Collaborator

Closing per discussion on the corresponding issue

@absurdfarce absurdfarce closed this Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON connector doesn't retain unicode values

2 participants