Spark executor overheads #543

kondratyevd · 2021-06-07T19:30:13Z

kondratyevd
Jun 7, 2021

Follow-up on our discussion about Spark executor:
in my tests (partition size 100k), ~50% of the entire processing time is spent on this line:

coffea/coffea/processor/templates/spark.py.tmpl

Line 7 in 6d54853

    
           columns = [{% for col in cols %}awkward.Array({{col}}){{ "," if not loop.last }}{% endfor %}]

which I think is conversion from pandas(?) to awkward.Array before the data is even loaded into a processor instance.

I haven't completely figured out other overheads yet, but they are less significant.
Actual useful work with Spark executor currently takes 25-30% of total processing time.

lgray · 2021-06-07T19:33:27Z

lgray
Jun 7, 2021
Maintainer

Yeah, as I was indicating - the churn through pandas is a really nasty time user.

It's pretty easy to hack it into spark, using arrow entirely and skip pandas (and make the operation you see here vanishingly small):
lgray/spark@58a4854

But you have to hack and re-roll your own spark distribution.

1 reply

jpivarski Jun 7, 2021

If "col" is a NumPy array of numbers, that line would be okay, but if it contains Python lists (which would be the case if it's jagged data in Pandas), then it's going to be expensive. The sad part about it is that the data start as Arrow arrays, which can be zero-copy passed into Awkward Arrays (even if jagged) with ak.from_arrow, but instead spark_udf makes Pandas and we have to slowly iterate through it. THAT is the PR that Spark wouldn't accept after something like a year of debates.

kondratyevd · 2021-06-07T20:36:42Z

kondratyevd
Jun 7, 2021
Author

@lgray thanks for the link! We are going to implement the fix and see if the performance will be close to that with Dask executor.

Personally, I would not be comfortable showing performance studies for native Spark, knowing that the timing is completely dominated by a feature that can be fixed so easily. Entire Dask vs. Spark comparison in this case just boils down to the effect of that feature.

1 reply

jpivarski Jun 7, 2021

The performance studies wouldn't reflect on Spark itself, but it would reflect on the way we'd have to use of a non-forked version of Spark (unless we start writing physics code in Scala, which was a direction I was considering when I started this in 2016).

kondratyevd · 2021-08-12T18:00:30Z

kondratyevd
Aug 12, 2021
Author

@lgray I tried your suggestion, but I'm not sure it it's working - timing did not improve, and if I put print() statements near the lines that we are hacking in pyspark, I don't see those printouts in the output.

Is there any way to check if these functions are actually getting called?

0 replies

lgray · 2021-08-12T18:02:43Z

lgray
Aug 12, 2021
Maintainer

That means your spark build didn't catch the changes. How'd you roll up the spark distribution?

…

-L

On Thu, Aug 12, 2021 at 1:00 PM Dmitry Kondratyev ***@***.***> wrote: @lgray <https://github.com/lgray> I tried your suggestion, but I'm not sure it it's working - timing did not improve, and if I put print() statements near the lines that we are hacking in pyspark, I don't see those printouts in the output. Is there any way to check if these functions are actually getting called? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#543 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEYOLLTKUMPTRZPG5FWXLT4QD4TANCNFSM46IKQKDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

1 reply

kondratyevd Aug 12, 2021
Author

We install Spark into user space as follows:

extract spark-2.4.4-bin-hadoop2.7.tgz into a directory where I have full access
create a new module with LMOD, which knows the path to the location where the spark distribution was extracted
in the module settings, set up necessary envs like JAVA_HOME, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON
load the module
run the code with Spark executor

I checked that the code doesn't run before the module is loaded, but does run after it is loaded. So I'm almost sure that it accesses the correct location.

lgray · 2021-08-12T18:21:13Z

lgray
Aug 12, 2021
Maintainer

So there's a secret python zip ball in the spark distribution that you have to rebuild in order for this to take. You need to: - edit the code - rebuild your own spark.tgz - then deploy the *rebuilt* spark tgz annoying - but it works!

…

-L

On Thu, Aug 12, 2021 at 1:18 PM Dmitry Kondratyev ***@***.***> wrote: We install Spark into user space as follows: - extract spark-2.4.4-bin-hadoop2.7.tgz into a directory where I have full access - create a new module with LMOD, which knows the path to the location where the spark distribution was extracted - in the module settings, set up necessary envs like JAVA_HOME, PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON - load the module - run the code with Spark executor I checked that the code doesn't run before the module is loaded, but does run after it is loaded. So I'm almost sure that it accesses the correct location. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#543 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEYOLRL7KED563PDPWZZ3T4QF7DANCNFSM46IKQKDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

0 replies

kondratyevd · 2021-08-12T19:14:59Z

kondratyevd
Aug 12, 2021
Author

@lgray

tar -cvzf spark-2.4.4-bin-hadoop2.7.tgz spark-2.4.4-bin-hadoop2.7 (version with edited files)
tar -xvzf spark-2.4.4-bin-hadoop2.7 (into other directory)

this didn't change much. Am I supposed to repack it somehow differently?

0 replies

lgray · 2021-08-12T19:43:55Z

lgray
Aug 12, 2021
Maintainer

Follow the instructions here: https://spark.apache.org/docs/2.4.4/building-spark.html#building-a-runnable-distribution

…

-L

On Thu, Aug 12, 2021 at 2:15 PM Dmitry Kondratyev ***@***.***> wrote: @lgray <https://github.com/lgray> tar -cvzf spark-2.4.4-bin-hadoop2.7.tgz spark-2.4.4-bin-hadoop2.7 (version with edited files) tar -xvzf spark-2.4.4-bin-hadoop2.7 (into other directory) this didn't change much. Am I supposed to repack it somehow differently? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#543 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIEYOI67Q77YVYSFU5YFQ3T4QMT5ANCNFSM46IKQKDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark executor overheads #543

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spark executor overheads #543

kondratyevd Jun 7, 2021

Replies: 7 comments · 3 replies

lgray Jun 7, 2021 Maintainer

jpivarski Jun 7, 2021

kondratyevd Jun 7, 2021 Author

jpivarski Jun 7, 2021

kondratyevd Aug 12, 2021 Author

lgray Aug 12, 2021 Maintainer

kondratyevd Aug 12, 2021 Author

lgray Aug 12, 2021 Maintainer

kondratyevd Aug 12, 2021 Author

lgray Aug 12, 2021 Maintainer

kondratyevd
Jun 7, 2021

Replies: 7 comments 3 replies

lgray
Jun 7, 2021
Maintainer

kondratyevd
Jun 7, 2021
Author

kondratyevd
Aug 12, 2021
Author

lgray
Aug 12, 2021
Maintainer

kondratyevd Aug 12, 2021
Author

lgray
Aug 12, 2021
Maintainer

kondratyevd
Aug 12, 2021
Author

lgray
Aug 12, 2021
Maintainer