-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[raydp-317] reconcile slf4j and log4j versions between spark and ray #318
Conversation
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@carsonwang please help review. |
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@kira-lin I just update the description. |
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like our way to set options(configurations) for different processes is very messy now. The logic is not clear, and is split around in our code.
We can write down how configurations are spread to our processes. For example, we first take input from users, and set configurations for our jvm(runs RayAppMaster), and spark driver. Spark executor's configuration is set by RayAppMaster, how does it do so, etc.
Does spark driver need these config? If not, can we separate these from native spark ones in init_spark
?
.split("@")[0]; | ||
String logDir = System.getProperty("ray.logging.dir"); | ||
if (logDir == null) { | ||
logDir = "/tmp/ray/process-" + pid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, Ray logging should go to /tmp/ray/session_latest/logs. Can we create a directory under this dir?
For master and ray executors, the "ray.logging.dir" is well set. But for spark driver (SparkSubmit), it's not set yet.
I agree with you to set it to "/tmp/ray/session_latest/logs" to make it consistent.
parentDir.mkdirs(); | ||
} | ||
|
||
File logFile = new File(parentDir, "/slf4j-" + pid + ".log"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we do not need a separate dir for every process, as we are naming the file with pid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will get printed to this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It contains below logs for each executor and master. If I don't capture them, they'll be print in pyspark-shell.
"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/raydp/jars/raydp-agent-1.6.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/ray/jars/ray_dist.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
mapped factory class: org.apache.logging.slf4j.Log4jLoggerFactory. load org.apache.logging.slf4j.Log4jLoggerFactory from file:/home/jiafu/anaconda3/envs/ray/lib/python3.9/site-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we do not need a separate dir for every process, as we are naming the file with pid?
I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a separate file for each process. Otherwise, we get messed up logs from different processes. It's not good for later tracing.
Yes, that's true. But they are in different dirs, right? As you created dir above also with process id. We can share the same dir, different filename is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed parentDir to same dir for all processes as you suggested yesterday. Please check latest code.
|
||
public class RayDPConstants { | ||
|
||
public static final String SPARK_JAVAAGENT = "spark.javaagent"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we merge this file with SparkOnRayConfigs.java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I should merge them. RayDPConstants was introduced initially from the agent module.
You can put them in the preferred classpath. | ||
``` | ||
raydp.init_spark(..., configs={'spark.log4j.config.file.name': 'log4j-cust.properties', 'spark.ray.log4j.config.file.name': 'log4j2-cust.xml'}) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make it more detailed? Elaborate on how javaagent work, which configuration is effective on which process, and which process is spawned by Ray or Spark, etc. This will make our maintenance easier, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. actually there are two types, spark driver and ray worker. I'll map them explicitly.
I'll add more doc for these configs. Spark driver needs some of them. For user, "init_spark" is the only entry for them to set config. |
I just addressed all comments. Please help review again. |
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
@kira-lin one of test failed with "RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1679480349858.local". Did you see similar issue before? |
|
|
It has subtlety here since some configs need to be prefixed with "spark." otherwise they'll be filtered out by spark and thus cannot be propagated in spark JVMs. |
Ok, I assume there is no issue in our code then. |
@kira-lin Beside below comments, do you have other concerns for this PR? |
LGTM. One last question: what will happen if fault_tolerant_mode is set to True? In that case, spark driver will also be connected to Ray. |
let me check. |
I added below line in 'connectToRay' method after Ray.init() since it initializes log in ray's own way instead of Spark's. SparkContext.getOrCreate().setLogLevel("WARN") It restores driver's log level to 'WARN'. Without above line, I got below additional output with 'fault_tolerant_mode=True'. '
thanks. |
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
Signed-off-by: jiafu zhang <jiafu.zhang@intel.com>
LGTM. Thanks |
Basic Idea: