Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[raydp-317] reconcile slf4j and log4j versions between spark and ray #318

Merged
merged 18 commits into from
Mar 27, 2023

Conversation

jiafuzha
Copy link
Contributor

@jiafuzha jiafuzha commented Mar 17, 2023

Basic Idea:

  1. Use agent to capture annoying SLF4J warning message so that it doesn't show up in shell
  2. Use our own StaticLoggerBinder to choose which underlying log4j framework to bind with SLF4J
  3. divide-and-conquer strategy. Different bindings based on actual spark version and ray version for Spark driver and Spark executors inside ray worker.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@jiafuzha
Copy link
Contributor Author

@carsonwang please help review.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@jiafuzha
Copy link
Contributor Author

@kira-lin I just update the description.

Copy link
Collaborator

@kira-lin kira-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.

We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.

Does spark driver need these config? If not, can we separate these from native spark ones in init_spark?

.split("@")[0];
String logDir = System.getProperty("ray.logging.dir");
if (logDir == null) {
logDir = "/tmp/ray/process-" + pid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?

For master and ray executors, the "ray.logging.dir" is well set. But for spark driver (SparkSubmit), it's not set yet.

I agree with you to set it to "/tmp/ray/session_latest/logs" to make it consistent.

parentDir.mkdirs();
}

File logFile = new File(parentDir, "/slf4j-" + pid + ".log");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we do not need a separate dir for every process, as we are naming the file with pid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will get printed to this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It contains below logs for each executor and master. If I don't capture them, they'll be print in pyspark-shell.
"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/raydp/jars/raydp-agent-1.6.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/ray/jars/ray_dist.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
mapped factory class: org.apache.logging.slf4j.Log4jLoggerFactory. load org.apache.logging.slf4j.Log4jLoggerFactory from file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we do not need a separate dir for every process, as we are naming the file with pid?

I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.

Yes, that's true. But they are in different dirs, right? As you created dir above also with process id. We can share the same dir, different filename is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed parentDir to same dir for all processes as you suggested yesterday. Please check latest code.


public class RayDPConstants {

public static final String SPARK_JAVAAGENT = "spark.javaagent";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge this file with SparkOnRayConfigs.java?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I should merge them. RayDPConstants was introduced initially from the agent module.

You can put them in the preferred classpath.
```
raydp.init_spark(..., configs={'spark.log4j.config.file.name': 'log4j-cust.properties', 'spark.ray.log4j.config.file.name': 'log4j2-cust.xml'})
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make it more detailed? Elaborate on how javaagent work, which configuration is effective on which process, and which process is spawned by Ray or Spark, etc. This will make our maintenance easier, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. actually there are two types, spark driver and ray worker. I'll map them explicitly.

@jiafuzha
Copy link
Contributor Author

I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.

We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.

Does spark driver need these config? If not, can we separate these from native spark ones in init_spark?

I'll add more doc for these configs. Spark driver needs some of them. For user, "init_spark" is the only entry for them to set config.

@jiafuzha
Copy link
Contributor Author

I just addressed all comments. Please help review again.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@jiafuzha
Copy link
Contributor Author

@kira-lin one of test failed with "RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1679480349858.local".

Did you see similar issue before?

@kira-lin
Copy link
Collaborator

For user, "init_spark" is the only entry for them to set config.
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

@kira-lin
Copy link
Collaborator

Did you see similar issue before?
No. Another mac test has passed. Maybe it's some problem with github CI.

@jiafuzha
Copy link
Contributor Author

For user, "init_spark" is the only entry for them to set config.
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.

It has subtlety here since some configs need to be prefixed with "spark." otherwise they'll be filtered out by spark and thus cannot be propagated in spark JVMs.

@jiafuzha
Copy link
Contributor Author

Did you see similar issue before?
No. Another mac test has passed. Maybe it's some problem with github CI.

Ok, I assume there is no issue in our code then.

@jiafuzha
Copy link
Contributor Author

@kira-lin Beside below comments, do you have other concerns for this PR?
'
Yes, I wonder if we can have two parameters for this function, one for native spark config, one for ours.
'
Considering it's API change, @carsonwang , what's your points on the API change in the init_spark() function?

@kira-lin
Copy link
Collaborator

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

@jiafuzha
Copy link
Contributor Author

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

@jiafuzha
Copy link
Contributor Author

jiafuzha commented Mar 27, 2023

LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray.

let me check.

I added below line in 'connectToRay' method after Ray.init() since it initializes log in ray's own way instead of Spark's.

SparkContext.getOrCreate().setLogLevel("WARN")

It restores driver's log level to 'WARN'. Without above line, I got below additional output with 'fault_tolerant_mode=True'.

'
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls to: jiafu
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing modify acls to: jiafu
2023-03-27 06:34:35,970 INFO SecurityManager [Thread-4]: Changing view acls groups to:
2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: Changing modify acls groups to:
2023-03-27 06:34:35,971 INFO SecurityManager [Thread-4]: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jiafu); groups with view permissions: Set(); users with modify permissions: Set(jiafu); groups with modify permissions: Set()
2023-03-27 06:34:35,987 INFO Utils [Thread-4]: Successfully started service 'RAY_RPC_ENV' on port 43805.

2023-03-27 06:34:39,026 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50312) with ID 0, ResourceProfileId 0
2023-03-27 06:34:39,028 INFO CoarseGrainedSchedulerBackend$DriverEndpoint [dispatcher-CoarseGrainedScheduler]: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.239.34.11:50302) with ID 1, ResourceProfileId 0
2023-03-27 06:34:39,088 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:44525 with 2.1 GiB RAM, BlockManagerId(0, 10.239.34.11, 44525, None)
2023-03-27 06:34:39,091 INFO BlockManagerMasterEndpoint [dispatcher-BlockManagerMaster]: Registering block manager 10.239.34.11:35787 with 2.1 GiB RAM, BlockManagerId(1, 10.239.34.11, 35787, None)
'

thanks.

Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@kira-lin
Copy link
Collaborator

LGTM. Thanks

@kira-lin kira-lin merged commit 67b0782 into oap-project:master Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants