Encoding issue when using spark with mesos #134

creyer · 2023-12-13T11:05:42Z

When using nebula-spark-connector data is not reaching the nebula storage with the right encoding.
The issue is only reproducing when spark is running in a mesos cluster.

By trying to do the following

val df = Seq((1,s"测试")).toDF("id","value")
df.write.nebula(config, nebulaWriteVertexConfig).writeVertices()

The value arrives in nebula as ??

More env data:
spark version: 3.0.2
java version: 1.8.0_392
mesos 1.11

I have tried multiple combinations of env:
(spark 3.2.1 or mesos 1.10) the results have not changed

Data should arrive in nebula as 测试 not as ??

(please note that running with spark local or even on kubernetes works fine.)

The text was updated successfully, but these errors were encountered:

wey-gu · 2023-12-13T11:10:43Z

We need to ensure spark cluster encoding with UTF-8

Sorry this post is CN but we could see from its bottom comment on where we may need to modify.

https://discuss.nebula-graph.com.cn/t/topic/5367/34?u=wey

wey-gu · 2023-12-13T11:22:09Z

As for mesos, I put some ref I got from my phone for ensuring UTF-8 encoding(gpt-4) 😋

Set Java Options: Ensure that the Java Virtual Machine (JVM) options for Spark are set to use UTF-8 encoding. You can set this in the Spark configuration (spark-defaults.conf) or through command line arguments:
```
spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8
spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8
```
Configure Mesos: Check the configuration of your Mesos cluster to ensure it's not overriding the environment variables or JVM options set for Spark. Mesos should pass the UTF-8 encoding settings to the Spark jobs.
Use Environment Variables: Set the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF-8 on each Mesos agent. This ensures that all Java applications, including Spark, use UTF-8 encoding by default.
Containerized Deployments: If you are using containers (like Docker) for deploying Spark on Mesos, make sure the containers are configured to use UTF-8 encoding. This can be set in the Dockerfile or the container’s environment settings.

Spark Job Submission: When submitting Spark jobs, explicitly specify the encoding:

spark-submit --conf "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8" \
             --conf "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8" \
             ...

Test with a Simple Job: Create a simple Spark job that reads and writes UTF-8 encoded data to validate the configuration.
Logging and Debugging: Enable detailed logging for both Spark and Mesos to diagnose any issues related to encoding.
Cluster-wide Settings: If you have access to modify cluster-wide settings, ensure that the default character set and locale settings are set to UTF-8 on all nodes in the Mesos cluster.

Remember, consistency across your entire deployment (drivers, executors, and the environment they run in) is key to ensuring that UTF-8 encoding is used throughout.

creyer · 2023-12-13T11:32:30Z

Ok, it seems indeed that adding extraJavaOptions=-Dfile.encoding=UTF-8 is the key
Issue solved, thank you

wey-gu · 2023-12-13T13:14:42Z

Ok, it seems indeed that adding extraJavaOptions=-Dfile.encoding=UTF-8 is the key

Issue solved, thank you

Great to know this is the line, we should add this in docs of spark connecter🫡

Nicole00 · 2023-12-14T02:04:38Z

it's the same as this issue.
https://docs.nebula-graph.io/3.6.0/import-export/nebula-exchange/ex-ug-FAQ/#q_how_to_correct_the_messy_code_when_importing_hive_data_into_nebulagraph

wey-gu mentioned this issue Dec 13, 2023

Spark connector/exchange UTF-8 vesoft-inc/nebula-docs#2391

Closed

wey-gu closed this as completed Dec 14, 2023

wey-gu mentioned this issue Dec 16, 2023

Weekly Report 2023-12-15 vesoft-inc/nebula-community#421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue when using spark with mesos #134

Encoding issue when using spark with mesos #134

creyer commented Dec 13, 2023 •

edited

Loading

wey-gu commented Dec 13, 2023

wey-gu commented Dec 13, 2023 •

edited

Loading

creyer commented Dec 13, 2023

wey-gu commented Dec 13, 2023

Nicole00 commented Dec 14, 2023 •

edited

Loading

Encoding issue when using spark with mesos #134

Encoding issue when using spark with mesos #134

Comments

creyer commented Dec 13, 2023 • edited Loading

wey-gu commented Dec 13, 2023

wey-gu commented Dec 13, 2023 • edited Loading

creyer commented Dec 13, 2023

wey-gu commented Dec 13, 2023

Nicole00 commented Dec 14, 2023 • edited Loading

creyer commented Dec 13, 2023 •

edited

Loading

wey-gu commented Dec 13, 2023 •

edited

Loading

Nicole00 commented Dec 14, 2023 •

edited

Loading