Spark backend #1832

johnynek · 2018-03-02T03:34:43Z

This is a very basic beginning to a spark backend.

It is not complete, but does support map-only operations.

There is one big question: can we really just lie to spark and say we have AnyRef everywhere? I think it may just make serialization worse (kryo writing the classnames), but maybe we can circumvent that later since scalding allows configs to have registered classes named, maybe we can pass that information to spark somehow.

cc @ianoc

johnynek · 2018-03-04T19:37:15Z

@ianoc can you take a look?

This is not finished, but what we have is testable and in the interest of keeping the PRs small, I'd like to merge this and then follow up with more parts:

(this) basic framework, map-only operation support.
reduce operation support.
join support
full Mode/Execution support

johnynek · 2018-03-05T22:20:12Z

build fails because spark is not there for 2.12. Will remove spark 2.12 from the CI.

ianoc · 2018-03-06T14:50:18Z

scalding-spark/src/main/scala/com/twitter/scalding/spark_backend/SparkBackend.scala

+object SparkPlanner {
+
+  // TODO, this may be just inefficient, or it may be wrong
+  implicit private def fakeClassTag[A]: ClassTag[A] = ClassTag(classOf[AnyRef]).asInstanceOf[ClassTag[A]]


i think they just register these to kryo so i imagine this should just hit the inefficient paths. For a normal execution app though this i guess does drop some sort of performance.

ianoc · 2018-03-06T14:53:38Z

scalding-spark/src/main/scala/com/twitter/scalding/spark_backend/SparkBackend.scala

+        case (ForceToDisk(pipe), rec) =>
+          rec(pipe).persist(StorageLevel.DISK_ONLY)
+        case (Fork(pipe), rec) =>
+          rec(pipe).persist(StorageLevel.MEMORY_ONLY)


don't need to bother updating this here, but i'd leave these to DISK_ONLY until we can figure out how/when we can upgrade the state. (Spark also has the output of shuffles persisted, so we might be able to get the planner to realize when we should have shuffle data and not do the persisting then).

ianoc · 2018-03-06T14:58:51Z

scalding-spark/src/main/scala/com/twitter/scalding/spark_backend/SparkBackend.scala

+          ???
+        case (slk @ SumByLocalKeys(_, _), rec) =>
+          def sum[K, V](sblk: SumByLocalKeys[K, V]): R[(K, V)] = {
+            // we can use Algebird's SummingCache https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/SummingCache.scala#L36


since these partitions are ~usually ondisk or can fit in memory we might just want MapAlgebra.sumByKey possibly (or sort and fold i guess..).

ianoc · 2018-03-06T15:01:20Z

Some comments but they are more of an ongoing discussion than anything else. LGTM

johnynek · 2018-03-06T18:53:05Z

thanks for the comments. these seem like good points to keep in mind as we optimize. All of your comments are doable for sure. Will address in the follow ups.

WIP: Spark backend

5d471cf

This was referenced Mar 3, 2018

Add Resolver type to clean up non-cascading backends #1835

Merged

Add a spark backend #1741

Open

Simplify the Memory platform using Dagon memoization #1836

Merged

johnynek added 2 commits March 3, 2018 14:49

remove the casts in spark platform

afe3021

add tests of current functionality

5441b2c

johnynek changed the title ~~WIP: Spark backend~~ Spark backend Mar 4, 2018

only test spark on 2.11, no 2.12

35be6e3

ianoc reviewed Mar 6, 2018

View reviewed changes

johnynek merged commit db64ad3 into develop Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark backend #1832

Spark backend #1832

johnynek commented Mar 2, 2018 •

edited

Loading

johnynek commented Mar 4, 2018

johnynek commented Mar 5, 2018

ianoc Mar 6, 2018

ianoc Mar 6, 2018

ianoc Mar 6, 2018

ianoc commented Mar 6, 2018

johnynek commented Mar 6, 2018

Spark backend #1832

Spark backend #1832

Conversation

johnynek commented Mar 2, 2018 • edited Loading

johnynek commented Mar 4, 2018

johnynek commented Mar 5, 2018

ianoc Mar 6, 2018

Choose a reason for hiding this comment

ianoc Mar 6, 2018

Choose a reason for hiding this comment

ianoc Mar 6, 2018

Choose a reason for hiding this comment

ianoc commented Mar 6, 2018

johnynek commented Mar 6, 2018

johnynek commented Mar 2, 2018 •

edited

Loading