-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Make XGBoost4j-Spark to support PySpark #7578
Comments
What about including it within the xgboost python package? Obviously it should gracefully fail when the xgboost4j-spark JAR is unavailable.
From the PR code:
Could you add a way to set the Scala tracker instead of the Python one? In Java/Scala I set:
|
My personal preference is we can get a design doc first. I spent the last 2 days learning about spark and was able to build a Python interface using the new arrow serialization format in pyspark ( |
While you can train python xgboost on a single node and perform a parallel inference with .mapInPandas (in pyspark >= 3.0) or a @pandas_udf (in pyspark >= 2.3), the problem is that the training step is not distributed. |
Thank you for the reply, could you please be more specific why it can't be distributed? |
.mapInPandas (or @pandas_udf) applies a function to a batch/group of Spark DataFrame rows. This can be easily used to parallelize inference of python xgboost. I don't have a deep understanding of xgboost inner working, but if you want to distribute training you have to design a distributed training stategy. This is what xgboost4j-spark apparently does (using Spark). I'm not saying you can't do that in python/pyspark, I'm pretty sure you can do it, but you would have to basically reimplement the xgboost4j-spark Spark-distributed training logic in python/pyspark. It would be nice, but I think the effort would be much higher than just wrapping xgboost4j-spark. |
Indeed. I don't want to reinvent the wheel either. On the other hand I have concerns over the complexity and the lack of flexibility of building on top of JVM package. Thinking about building and distributing the package through pip and conda, debugging any issue, model compatibility between languages and lastly the difficulty of adding new features. As for the internal working of xgboost, I think the spark package just repartitions the data to number of workers then train on each partition. In contrast, the dask package can access partitions on each worker directly so no partitioning is required. I can't be entirely sure about the ray package at the moment. |
@trivialfis @wbo4958 Hey everyone, I do find it funny that people are still using my code from PR #4656 after all these years! The tracking issue from 2018, still has the remaining steps to finalize PySpark support:
|
@trivialfis funnily enough the whole of core For reference, here are some of the |
The main advantage of starting from python xgboost instead of the JVM package is that the JVM version is always lagging behind the python version in terms of features. |
@candalfigomoro implementing a Spark package in Java/Scala is MUCH easier than Python, and since we can get Python effectively for free with Py4J (and wrapping the Java API), going through all that effort would be for nothing. |
@thesuperzapper |
@candalfigomoro that question seems disconnected. Running the |
Thank you for the discussion! To me, at this early stage, I think a better question to ask is what users want from a PySpark XGBoost integration. For example, what motivates you to use PySpark instead of the Scala version? And how can the PySpark-XGBoost interface be part of the workflow you fell in love with. Are the users required to know anything about java/scala? How to implement the interface is a different question. Hence I argue that a design doc is needed. For what users want from the PySpark package, I have to assume it's the ecosystem of Python along with the features provided by Spark. The ecosystem consists of both toolchains and libraries. From this perspective, I think it's a good idea to ship the Python wrapper as a proper Python package instead of some files inside a jar. So that's the interface for installation. Adding to this, I think it will be a Python-oriented package so the users should not be asked to understand what a jar is, in the same way as xgboost4j-spark users are not required to know about the Being Python-oriented also implies that the pyspark-xgboost package should have some basic interoperability with the rest of the Python world and the XGBoost Python package (instead of the JVM package). For the former, that includes document generation with sphinx, type hints, serialization (pickle), type hierarchy, run in a Python shell when it's using local mode, etc. The PySpark package itself has done a great job on this front and we should follow. For the latter, the current JVM Spark package produces models that are not compatible with other language bindings, which is disconcerting and I really want to have it addressed, at least that should not happen to the PySpark XGBoost package. Another example is the callback functions and custom metrics/objectives in Python. Also, with the interoperability, we can use some other libraries in the Python ecosystem like SHAP to assist analysis, treelite/FIL for faster inference, etc. These might also be what users want. Whether it's implemented using the existing JVM package or it's built from scratch is an important decision to make for both feature development and deployment. But meeting the convention and interoperability of the Python ecosystem is a more important aspect to me, to a degree that it might be the only value of having such integration. If the PySpark integration doesn't have this interoperability we might just use the scala version instead. In summary, I think it should be a higher level, standard Python package to make the integration most useful, whether it's built on top of xgboost4j or python xgboost needs to be hidden as an implementation detail. |
sorry for the late response. I just was on vacation. @trivialfis, I didn't test PySpark on XGBoost-Python package, and I guess it can work well for the CPU pipeline. But for the GPU pipeline, it may not fully leverage the parallelity of Spark to accelerate the whole pipeline. eg, I'd like to train the XGBoost model using PySpark on XGBoost-python on the Spark cluster (48 CPU cores, 4 GPU, 4 Workers) On the other hand, 1.6.0-SNAPSHOT has introduced XGBoost-Spark-GPU which can typically accelerate the ETL phase, which makes XGBoost more competitive. |
@wbo4958 Hey, I'm not against using the JVM package, please see #7578 (comment) . PySpark mllib is built on top of existing scala implementation and it's perfectly fine. |
Hi @trivialfis, As you mentioned in #7578 (comment), you have successfully tried to run the XGBoost Python package in PySpark env. It looks good for now that we don't need to do anything to support XGBoost-Python package in PySpark, that's cool, right? So, here again, the FEA request, we can think it is the extension of XGBoost-JVM, because it is just an XGBoost JVM API wrapper, any logic of it will be routed to XGBoost JVM. So, here we provide the Python users with two ways to use XGBoost.
|
That's what I said in #7578 (comment) My usage scenario would be: train xgboost on larger-than-memory datasets by leveraging Spark-distributed computation and integrate xgboost training within a python/pyspark code base. Potential solutions are:
|
That's mostly caused by me not being familiar with the JVM ecosystem. Feel free to contribute. ;-)
We are maintaining a dask module for distributed training and inference. |
Quick update: after long arguments with @wbo4958 , we came to somewhere in the middle and might follow https://nlp.johnsnowlabs.com/docs/en/install#python as an example for package distribution. In essence, the Python package and JVM package will be separately managed so Python users still have their toolchains working while Spark will manage the JVM dependencies. As a feature request from me, we might try to change the xgboost4j spark package so that the model is more portable and can be used by other Python tools. Aside from these, we will try to add some small polishments to the original PR from @thesuperzapper before getting it ready for another round of attempt. |
Databricks implemented PySpark XGBoost integration on top of Python XGBoost instead xgboost4j-spark. You can find API docs here. We are happy to contribute it to XGBoost official repo if the stakeholders agree it is better for PySpark users. To assist the discussion, we open sourced our implementation at https://github.com/mengxr/pyspark-xgboost with Apache License. Please take a look. Note that it doesn't contain the code that can efficiently handle sparse data because we use a private method internally. Hopefully it is not a blocker for design discussions. |
Hi @mengxr . Could you please share some insight into why do you choose the Python approach instead of the Jvm approach? I understand that the Python approach can lead to a better user experience for Python users. But for the longer term, we settled on the JVM approach due to memory consumption and potential performance penalty from data serialization (arrow & pickle). Also, it would be easier for us to integrate GPU into the user pipeline. But we would like to learn more about the rationale behind your decision. I was informed that you think the JVM approach doesn't meet the requirement of databricks, could you please elaborate on this part? I'm open to suggestions and not biased toward any approach (as suggested in previous comments in this thread). |
We ran some benchmark and we haven't observed performance penalty from data serialization yet. We only need the data conversion once before training. So the memory consumption during training should be the same for both implementation. Both are calling native XGBoost for distributed training. Do you have examples to demonstrate memory/performance issues? Our implementation also supports GPU. We can run benchmark to see if there are performance gaps. Again, the only difference is the initial data conversion/transfer. Most of the benefits of going python native were discussed in this thread. I think it outweighs minor performance gaps, if any. I cherrypicked a few items from our requirements doc:
This is not an issue on Databricks but it would be an issue for open-source PySpark/XGBoost users.
|
Thx @mengxr, Here are some comments about the requirements,
Jvm side has packed all the parameters to xgboost including those not-defining on Param, see, https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifier.scala#L199 . So jvm should also auto-supports the new parameters without any changing.
Yeah, the native model trained from jvm side should be xgboost-compatible, so it's not hard to convert to xgboost-python model.
Yeah, the issue has been fixed and right now, the JVM is quite stable.
Yeah, that's painful for jvm. But if users are trying it on Spark, then they should be supposed to have some knowledge about the parameters of spark-submit. I also like your solution which is really good for the users. But after testing, JVM can train the datasize which is about 2x than your solution before crashing. And there is also no rank in your solution. What's more, it's really like re-inventing the wheels. And if xgboost community accepts your solution, then there will be 2 different ways to make XGBoost to run on Spark, this is really hard to maintain two different ways from XGBoost community point of view. |
Do you plan to provide an API for PySpark users to get the Python model object directly?
I think that underestimated the usability issues. Managing Python dependencies on PySpark is not a trivial task. See https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html . Side tracking jars adds more complexity. For example, this is a common workflow:
If we go with the scalar wrapper, a user might use the latest version of XGBoost (Python) but an older version of xgboost4j-spark, where behavior changes could happen. There are issues with Scala version compatibility too. Spark supports Scala 2.13 since 3.2 release (7 months ago). But xgboost4j doesn't support it yet. See #6596.
Do you mind sharing the benchmark code? Both approaches are just doing data conversion and then calling XGBoost for distributed training. Using PySpark doesn't need to keep a copy of the data. XGBoost (Python) might keep an extra copy. If that is the case, we just need a stream input for DMatrix.
Understood the extra maintenance cost. At the end, I think it depends on how you compare PySpark user experience and community maintenance cost. Note that there are also much more Python developers than Scala developers who can contribute to the project. |
Forgot to ask: how do you plan to support callback with the Scala wrapper? |
@mengxr Is there a clear path for the implementation of sparse data support? |
Excellent! Will get back to this on Monday. |
We made multiple comparisons between the Pythonic approach and the JVM approach. The result is inconclusive. But I would like to push forward instead of stalling on this much-anticipated feature. @wbo4958 would like to proceed with the JVM approach but I would like to continue on the Pythonic approach (the one from @mengxr) despite I tried to be neutral for a long while. My preference is due to one simple reason: It should be a Python package targeting Python programmers. If we don’t have that, there’s little value in introducing yet another interface. The JVM package is always there. Following is a list of pros and cons of the Pythonic approach I summarized from various conversations for future reference. Pros
I strongly agree. I gave an example in an offline discussion. If I were to write a forecasting library based on a bunch of ML libraries like XGBoost, LightGBM, Pytorch, etc, and users are expected to use my library instead of interacting with underlying implementation directly, do they need to match those package versions themselves? With the size of the Python environment, things can get out of hand real quick.
Following are the concerns around the Pythonic approach that I have received:
My conclusion is that neither approach is perfect. But I would like to choose one of them and push forward, and the Python approach makes better sense to me as it has fewer problems for us to solve and is more promising. (I have high hopes that Spark developers can get them right). Most importantly, it actually looks like a Python package. |
If there are other concerns about user experience please feel free to raise them. Otherwise, let's move forward and get a working prototype for merge. |
@mengxr Please let us know if there's anything we can help. |
@trivialfis I want to confirm the final decision. Are we doing both (@wbo4958 on xgboost4j wrapper and others on native python wrapper) or we go with native Python wrapper only? It isn't very clear to me and I'm not familiar with the sign-off process here. |
We are going with the Python wrapper approach as there's no new concern on the user experience side. We won't support 2 implementations, so don't worry about that. A related question, since the Python wrapper in its current state is not quite complete, will you continue the development and maintenance after the initial merge? |
On top of the code we open sourced, we will add back sparse data support. I guess we can add LTR as well, which should be straightforward. And yes, we would like to help maintenance after the initial merge. |
@mengxr, Yeah, there is the only pythonic way. Please file the first PR. Thx |
Python based implementation internally calls the So it can support it. |
@trivialfis I created a draft PR #8020 |
Close this issue, Since @WeichenXu123 has made the PR which based on the xgboost python package. |
For now, Both XGBoost4j-Spark and XGBoost4j-Spark-GPU don't support XGBoost run on PySpark env, which may cause users hard to use XGBoost4j since they may know nothing about scala language.
There were two PRs about how to make XGBoost support PySpark. #4656, and #5658. I personally prefer the #4656 which composes the XGBoost4j-Spark python wrapper into the XGBoost4j-Spark.jar, so, in this case, the users do not need to add extra jars when submitting the XGBoost applications.
@thesuperzapper Could you continue to finish your previous PR?
The text was updated successfully, but these errors were encountered: