Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue when using spark with mesos #134

Closed
creyer opened this issue Dec 13, 2023 · 5 comments
Closed

Encoding issue when using spark with mesos #134

creyer opened this issue Dec 13, 2023 · 5 comments

Comments

@creyer
Copy link

creyer commented Dec 13, 2023

When using nebula-spark-connector data is not reaching the nebula storage with the right encoding.
The issue is only reproducing when spark is running in a mesos cluster.

By trying to do the following

val df = Seq((1,s"测试")).toDF("id","value")
df.write.nebula(config, nebulaWriteVertexConfig).writeVertices()

The value arrives in nebula as ??

More env data:
spark version: 3.0.2
java version: 1.8.0_392
mesos 1.11

I have tried multiple combinations of env:
(spark 3.2.1 or mesos 1.10) the results have not changed

Data should arrive in nebula as 测试 not as ??

(please note that running with spark local or even on kubernetes works fine.)

@wey-gu
Copy link
Contributor

wey-gu commented Dec 13, 2023

We need to ensure spark cluster encoding with UTF-8

Sorry this post is CN but we could see from its bottom comment on where we may need to modify.

https://discuss.nebula-graph.com.cn/t/topic/5367/34?u=wey

@wey-gu
Copy link
Contributor

wey-gu commented Dec 13, 2023

As for mesos, I put some ref I got from my phone for ensuring UTF-8 encoding(gpt-4) 😋

  1. Set Java Options: Ensure that the Java Virtual Machine (JVM) options for Spark are set to use UTF-8 encoding. You can set this in the Spark configuration (spark-defaults.conf) or through command line arguments:

    spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8
    spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8
  2. Configure Mesos: Check the configuration of your Mesos cluster to ensure it's not overriding the environment variables or JVM options set for Spark. Mesos should pass the UTF-8 encoding settings to the Spark jobs.

  3. Use Environment Variables: Set the environment variable JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF-8 on each Mesos agent. This ensures that all Java applications, including Spark, use UTF-8 encoding by default.

  4. Containerized Deployments: If you are using containers (like Docker) for deploying Spark on Mesos, make sure the containers are configured to use UTF-8 encoding. This can be set in the Dockerfile or the container’s environment settings.

  5. Spark Job Submission: When submitting Spark jobs, explicitly specify the encoding:

    spark-submit --conf "spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8" \
                 --conf "spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8" \
                 ...
  6. Test with a Simple Job: Create a simple Spark job that reads and writes UTF-8 encoded data to validate the configuration.

  7. Logging and Debugging: Enable detailed logging for both Spark and Mesos to diagnose any issues related to encoding.

  8. Cluster-wide Settings: If you have access to modify cluster-wide settings, ensure that the default character set and locale settings are set to UTF-8 on all nodes in the Mesos cluster.

Remember, consistency across your entire deployment (drivers, executors, and the environment they run in) is key to ensuring that UTF-8 encoding is used throughout.

@creyer
Copy link
Author

creyer commented Dec 13, 2023

Ok, it seems indeed that adding extraJavaOptions=-Dfile.encoding=UTF-8 is the key
Issue solved, thank you

@wey-gu
Copy link
Contributor

wey-gu commented Dec 13, 2023

Ok, it seems indeed that adding extraJavaOptions=-Dfile.encoding=UTF-8 is the key

Issue solved, thank you

Great to know this is the line, we should add this in docs of spark connecter🫡

@Nicole00
Copy link
Contributor

Nicole00 commented Dec 14, 2023

it's the same as this issue.
https://docs.nebula-graph.io/3.6.0/import-export/nebula-exchange/ex-ug-FAQ/#q_how_to_correct_the_messy_code_when_importing_hive_data_into_nebulagraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants