[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

jiafuzha · 2023-03-17T10:01:27Z

Basic Idea:

Use agent to capture annoying SLF4J warning message so that it doesn't show up in shell
Use our own StaticLoggerBinder to choose which underlying log4j framework to bind with SLF4J
divide-and-conquer strategy. Different bindings based on actual spark version and ray version for Spark driver and Spark executors inside ray worker.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha · 2023-03-17T10:09:23Z

@carsonwang please help review.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha · 2023-03-21T02:05:27Z

@kira-lin I just update the description.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

kira-lin

I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.

We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.

Does spark driver need these config? If not, can we separate these from native spark ones in init_spark?

kira-lin · 2023-03-22T05:53:35Z

core/agent/src/main/java/org/apache/spark/raydp/Agent.java

+        .split("@")[0];
+    String logDir = System.getProperty("ray.logging.dir");
+    if (logDir == null) {
+      logDir = "/tmp/ray/process-" + pid;


By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?

By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?

For master and ray executors, the "ray.logging.dir" is well set. But for spark driver (SparkSubmit), it's not set yet.

I agree with you to set it to "/tmp/ray/session_latest/logs" to make it consistent.

kira-lin · 2023-03-22T05:58:59Z

core/agent/src/main/java/org/apache/spark/raydp/Agent.java

+      parentDir.mkdirs();
+    }
+
+    File logFile = new File(parentDir, "/slf4j-" + pid + ".log");


Maybe we do not need a separate dir for every process, as we are naming the file with pid?

What will get printed to this file?

It contains below logs for each executor and master. If I don't capture them, they'll be print in pyspark-shell.
"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/raydp/jars/raydp-agent-1.6.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/ray/jars/ray_dist.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
mapped factory class: org.apache.logging.slf4j.Log4jLoggerFactory. load org.apache.logging.slf4j.Log4jLoggerFactory from file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
"

Maybe we do not need a separate dir for every process, as we are naming the file with pid?

I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.

I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.

Yes, that's true. But they are in different dirs, right? As you created dir above also with process id. We can share the same dir, different filename is enough.

I've changed parentDir to same dir for all processes as you suggested yesterday. Please check latest code.

kira-lin · 2023-03-22T06:08:15Z

core/raydp-main/src/main/java/org/apache/spark/raydp/RayDPConstants.java

+
+public class RayDPConstants {
+
+    public static final String SPARK_JAVAAGENT = "spark.javaagent";


Should we merge this file with SparkOnRayConfigs.java?

yes. I should merge them. RayDPConstants was introduced initially from the agent module.

kira-lin · 2023-03-22T06:45:47Z

doc/spark_on_ray.md

+  You can put them in the preferred classpath.
+  ```
+  raydp.init_spark(..., configs={'spark.log4j.config.file.name': 'log4j-cust.properties', 'spark.ray.log4j.config.file.name': 'log4j2-cust.xml'})
+  ```


Could you make it more detailed? Elaborate on how javaagent work, which configuration is effective on which process, and which process is spawned by Ray or Spark, etc. This will make our maintenance easier, thanks

ok. actually there are two types, spark driver and ray worker. I'll map them explicitly.

jiafuzha · 2023-03-22T07:57:03Z

I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.

We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.

Does spark driver need these config? If not, can we separate these from native spark ones in init_spark?

I'll add more doc for these configs. Spark driver needs some of them. For user, "init_spark" is the only entry for them to set config.

jiafuzha · 2023-03-22T09:52:54Z

I just addressed all comments. Please help review again.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha · 2023-03-23T01:40:14Z

@kira-lin one of test failed with "RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1679480349858.local".

Did you see similar issue before?

kira-lin · 2023-03-23T01:57:47Z

For user, "init_spark" is the only entry for them to set config.
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

kira-lin · 2023-03-23T01:58:53Z

Did you see similar issue before?
No. Another mac test has passed. Maybe it's some problem with github CI.

jiafuzha · 2023-03-23T02:05:23Z

For user, "init_spark" is the only entry for them to set config.
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

It has subtlety here since some configs need to be prefixed with "spark." otherwise they'll be filtered out by spark and thus cannot be propagated in spark JVMs.

jiafuzha · 2023-03-23T02:05:55Z

Did you see similar issue before?
No. Another mac test has passed. Maybe it's some problem with github CI.

Ok, I assume there is no issue in our code then.

jiafuzha · 2023-03-27T01:53:10Z

@kira-lin Beside below comments, do you have other concerns for this PR?
'
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.
'
Considering it's API change, @carsonwang , what's your points on the API change in the init_spark() function?

kira-lin · 2023-03-27T05:10:59Z

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

jiafuzha · 2023-03-27T05:13:02Z

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

jiafuzha · 2023-03-27T06:41:15Z

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

I added below line in 'connectToRay' method after Ray.init() since it initializes log in ray's own way instead of Spark's.

SparkContext.getOrCreate().setLogLevel("WARN")

It restores driver's log level to 'WARN'. Without above line, I got below additional output with 'fault_tolerant_mode=True'.

'
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls to: jiafu
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing modify acls to: jiafu
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls groups to:
2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: Changing modify acls groups to:
2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jiafu); groups with view permissions: Set(); users with modify permissions: Set(jiafu); groups with modify permissions: Set()
2023-03-27 06:34:35,987 INFO Utils [Thread-4]: Successfully started service 'RAY_RPC_ENV' on port 43805.

2023-03-27 06:34:39,026 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50312) with ID 0, ResourceProfileId 0
2023-03-27 06:34:39,028 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50302) with ID 1, ResourceProfileId 0
2023-03-27 06:34:39,088 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:44525 with 2.1 GiB RAM, BlockManagerId(0, 10.239.34.11, 44525, None)
2023-03-27 06:34:39,091 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:35787 with 2.1 GiB RAM, BlockManagerId(1, 10.239.34.11, 35787, None)
'

thanks.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

kira-lin · 2023-03-27T08:21:38Z

LGTM. Thanks

jiafuzha added 4 commits March 16, 2023 08:05

reconcile slf4j and log4j versions between spark and ray

3d646f1

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

4fe927f

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

1bd5a5c

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

5fc6e22

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha added 5 commits March 17, 2023 10:13

reconcile slf4j and log4j versions between spark and ray

a450773

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

a663879

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

1600463

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

cbcb254

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

85d80fb

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha added 4 commits March 21, 2023 05:03

reconcile slf4j and log4j versions between spark and ray

9a7280d

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

e0ef3fc

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

Merge remote-tracking branch 'upstream/master' into ISSUE_317

921f8ce

Merge remote-tracking branch 'upstream/master' into ISSUE_317

21b20ef

kira-lin reviewed Mar 22, 2023

View reviewed changes

jiafuzha added 3 commits March 22, 2023 09:57

reconcile slf4j and log4j versions between spark and ray

70d3393

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

7694e21

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

reconcile slf4j and log4j versions between spark and ray

2a93bd0

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

jiafuzha mentioned this pull request Mar 23, 2023

Prepend ":job_id:<jobid>" to java-worker-<jobid>-<pid>.log to avoid too many executor logs flushed to shell console #321

Closed

jiafuzha added 2 commits March 27, 2023 06:43

[raydp-317] reconcile slf4j and log4j versions between spark and ray

f93b764

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

[raydp-317] reconcile slf4j and log4j versions between spark and ray

1b7fec0

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>

kira-lin merged commit 67b0782 into oap-project:master Mar 27, 2023

jiafuzha mentioned this pull request Mar 28, 2023

reconcile slf4j and log4j versions between spark and ray #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

jiafuzha commented Mar 17, 2023 •

edited

Loading

jiafuzha commented Mar 17, 2023

jiafuzha commented Mar 21, 2023

kira-lin left a comment

kira-lin Mar 22, 2023

jiafuzha Mar 22, 2023

kira-lin Mar 22, 2023

kira-lin Mar 22, 2023

jiafuzha Mar 22, 2023

jiafuzha Mar 22, 2023

kira-lin Mar 23, 2023

jiafuzha Mar 23, 2023

kira-lin Mar 22, 2023

jiafuzha Mar 22, 2023

kira-lin Mar 22, 2023

jiafuzha Mar 22, 2023

jiafuzha commented Mar 22, 2023

jiafuzha commented Mar 22, 2023

jiafuzha commented Mar 23, 2023

kira-lin commented Mar 23, 2023

kira-lin commented Mar 23, 2023

jiafuzha commented Mar 23, 2023

jiafuzha commented Mar 23, 2023

jiafuzha commented Mar 27, 2023

kira-lin commented Mar 27, 2023

jiafuzha commented Mar 27, 2023

jiafuzha commented Mar 27, 2023 •

edited

Loading

kira-lin commented Mar 27, 2023


		public class RayDPConstants {

		public static final String SPARK_JAVAAGENT = "spark.javaagent";

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

Conversation

jiafuzha commented Mar 17, 2023 • edited Loading

jiafuzha commented Mar 17, 2023

jiafuzha commented Mar 21, 2023

kira-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiafuzha commented Mar 22, 2023

jiafuzha commented Mar 22, 2023

jiafuzha commented Mar 23, 2023

kira-lin commented Mar 23, 2023

kira-lin commented Mar 23, 2023

jiafuzha commented Mar 23, 2023

jiafuzha commented Mar 23, 2023

jiafuzha commented Mar 27, 2023

kira-lin commented Mar 27, 2023

jiafuzha commented Mar 27, 2023

jiafuzha commented Mar 27, 2023 • edited Loading

kira-lin commented Mar 27, 2023

jiafuzha commented Mar 17, 2023 •

edited

Loading

jiafuzha commented Mar 27, 2023 •

edited

Loading