Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexibility in the deps #165

Merged
merged 4 commits into from
Jul 20, 2021
Merged

Flexibility in the deps #165

merged 4 commits into from
Jul 20, 2021

Conversation

mjschock
Copy link
Contributor

@mjschock mjschock commented Jul 7, 2021

This assumes the external library authors are using semver.

There's still #121 to address but this unblocks those of us that want to use a newer version of pyspark.

@@ -19,9 +19,9 @@ name: RayDP CI

on:
push:
branches: [ master ]
branches: [ main, master ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we don't have the main branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i added it b/c main is now the default branch rather than master. that way the workflow runs: https://github.com/mjschock/raydp/actions/runs/1007895832

python/setup.py Outdated
"psutil < 6.0.0",
"pyarrow >= 0.10, < 5.0.0",
"ray >= 1.4.0, < 2.0.0",
"pyspark >= 3.0.0, < 4.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kira-lin, does the raydp support spark 3.1.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it works fine when building with the older version in the pom.xml: #121 (comment). need to address that issue separately but this unblocks usage of 3.1.0 which is the minimum required for, e.g., delta-spark (https://pypi.org/project/delta-spark/), which i'd like to integrate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if there are demand for spark 3.1, we should give it a try. @mjschock can you please also modify the dependencies in the raydp.yaml to see if it pass CI?
By the way, I actually have the patch to support spark 3.1, but we are in a discussion about whether we should add a shim layer for different spark versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kira-lin i updated raydp.yml to use both pyspark 3.0.0 and 3.1.2 and they both pass

python/setup.py Outdated
"ray == 1.4.0",
"pyspark >= 3.0.0, < 3.1.0",
"netifaces"
"numpy < 2.0.0",
Copy link
Collaborator

@ConeyLiu ConeyLiu Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mjschock, I think the changes for pandas make sense. Why you change the others? They should be flexible already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ray has some internal changes for different versions. So we have to specify the given version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a preventative measure so i don't run into things like the pyspark versioning issue in the future. assuming other library authors play nice and abide by semver then patch and minor updates shouldn't break anything, only major. i can reduce the changes just to pyspark if desired but i wanted to be consistant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjschock , based on our previous experience, we usually need to make changes to support new Ray releases like 1.5. So we should probably keep ray == 1.4.0 otherwise things may break. For most other dependencies, my understanding is setting a minimal version or without setting anything should be fine. Can we just change pyspark and pandas for now to unblock your needs?

"psutil",
"pyarrow >= 0.10",
"ray == 1.4.0",
"pyspark >= 3.0.0, < 3.1.0",
"ray >= 1.4.0, < 1.5.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carsonwang - would this be alright, allowing patch changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks good. Thanks!

@kira-lin
Copy link
Collaborator

It seems like this PR has gone beyond its title. Can you please explain why you merged PR #161 ? Using code search path is not ready yet, because it requires all python function/actors to be in the code search path as well. This might break our current users application.

@mjschock
Copy link
Contributor Author

mjschock commented Jul 12, 2021

It seems like this PR has gone beyond its title. Can you please explain why you merged PR #161 ? Using code search path is not ready yet, because it requires all python function/actors to be in the code search path as well. This might break our current users application.

@kira-lin sorry about that, i should have created a separate branch. i've done that now and reverted the changes.

i'm now running into stackoverflow issues related to ZipArchive.scala and, after a bit of research, it seemed this could be related to an old version of scala, and possibly the usage of py4j, so i pulled in the changes to see if they helped get around that issue.

this is what i'm seeing:

  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master
    self._app_master_java_bridge.startUpAppMaster(extra_classpath)
  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/base/.cache/pypoetry/virtualenvs/app-V0mE0AR0-py3.8/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster.
: java.lang.StackOverflowError
	at scala.reflect.io.ZipArchive$.scala$reflect$io$ZipArchive$$dirName(ZipArchive.scala:58)
	at scala.reflect.io.ZipArchive.ensureDir(ZipArchive.scala:114)

@kira-lin
Copy link
Collaborator

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?

Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.

Again, thanks for contributing @mjschock

@mjschock
Copy link
Contributor Author

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?

Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.

Again, thanks for contributing @mjschock

yes, Spark 3.0.0 in the pom.xml when i run ./build.sh. and yes, pyspark 3.1.2 when intiializing. java is 1.8, the openjdk version. the example is roughly the one from the anyscale post here: https://www.anyscale.com/blog/data-processing-support-in-ray

import ray
import raydp

ray.util.connect("ray-head:10001")

@ray.remote
class PySparkDriver:
  def __init__(self):
  self.spark = raydp.init_spark(
    app_name='RayDP example',
    num_executors=2,
    executor_cores=2,
    executor_memory='4GB')

  def foo(self):
    return self.spark.range(1000).repartition(10).count()

driver = PySparkDriver.remote()
print(ray.get(driver.foo.remote()))

@mjschock
Copy link
Contributor Author

Oh, I see. Can you please be more specific on what causes this error? You are using Spark 3.0.0 when building the scala part, and 3.1.2 for pyspark, right? And when you call raydp.init_spark, you see this error message?
Besides, what java version are you using? It seems scala, spark and py4j versions are not modified, thus it should be working.
Again, thanks for contributing @mjschock

yes, Spark 3.0.0 in the pom.xml when i run ./build.sh. and yes, pyspark 3.1.2 when intiializing. java is 1.8, the openjdk version. the example is roughly the one from the anyscale post here: https://www.anyscale.com/blog/data-processing-support-in-ray

import ray
import raydp

ray.util.connect("ray-head:10001")

@ray.remote
class PySparkDriver:
  def __init__(self):
  self.spark = raydp.init_spark(
    app_name='RayDP example',
    num_executors=2,
    executor_cores=2,
    executor_memory='4GB')

  def foo(self):
    return self.spark.range(1000).repartition(10).count()

driver = PySparkDriver.remote()
print(ray.get(driver.foo.remote()))

anyway, i think this is unrelated. i'll investigate further when i get a chance

@kira-lin
Copy link
Collaborator

sorry I cannot help here since I've never seen errors like this, but I guess it's due to the environment. Did you always have the problem? Is it OK if you use pip to install our package? If so, I can upgrade the spark version now, since it'll only affect the nightly release.

@kira-lin
Copy link
Collaborator

By the way, we are curious about in what scenario you plan to use RayDP? What feature would be useful? And if you have any other issue, feel free to ask us.

@carsonwang carsonwang merged commit ff48b7b into oap-project:master Jul 20, 2021
@carsonwang
Copy link
Collaborator

Thanks @mjschock. Merged this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants