Skip to content

Conversation

@jegonzal
Copy link
Contributor

Removing bytecode inspection from triplet operations and introducing explicit join elimination flags. The explicit flags make the join elimination more robust.

@jegonzal
Copy link
Contributor Author

@ankurdave and @rxin I have not updated the applications to use the new explicit flags. I will do that in this PR pending approval for the API changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapUsesSrcAttr -> mapUsesDstAttr

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2815 at commit d32ec1c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2815 at commit d32ec1c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jegonzal jegonzal changed the title Remove Bytecode Inspection for Join Elimination [SPARK-3936] Remove Bytecode Inspection for Join Elimination Oct 15, 2014
@jegonzal
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2815 at commit 459fa8c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2815 at commit 459fa8c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2815 at commit 459fa8c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2815 at commit 459fa8c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have started for PR 2815 at commit e8a8a8d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 15, 2014

QA tests have finished for PR 2815 at commit e8a8a8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jegonzal
Copy link
Contributor Author

What is the status on this patch? I would like to merge it soon so that the python GraphX API can support these additional flags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd create an enum for that

@jegonzal
Copy link
Contributor Author

jegonzal commented Nov 1, 2014

I added the TripletFields enum and updated all the dependent files. I can't "deprecate" the old API since they have the same function signature up to default arguments and I don't want to require the TripletFields be specified.

@jegonzal
Copy link
Contributor Author

jegonzal commented Nov 1, 2014

At this point I could also imagine actually having a separate function closure for each version.

mapTriplets(f: Edge => ED2)

mapTriplets(f: SrcEdge => ED2)

mapTriplets(f: DstEdge => ED2)

mapTriplets(f: Triplet => ED2)

Though to do this would require users to annotate their functions:

g.mapTriplets( (t: SrcEdge) => t.src )

What do you all think?

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22667 has finished for PR 2815 at commit 4097ca7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 1, 2014

How about

mapSourceTriplets, mapDst ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is correct - see the comment on the line above.

@ankurdave
Copy link
Contributor

This design looks good to me. The alternatives are all worse:

@jegonzal It won't work to overload on the function type like that because of type erasure:

$ scala
scala> class A { def f(g: Int => Int): Int = g(1); def f(g: String => Int): Int = g("hello") }
<console>:7: error: double definition:
def f(g: Int => Int): Int at line 7 and
def f(g: String => Int): Int at line 7
have same type after erasure: (g: Function1)Int
       class A { def f(g: Int => Int): Int = g(1); def f(g: String => Int): Int = g("hello") }

@rxin That will prevent us from passing triplet field information along, as we do in Pregel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are out of date now that we have tripletFields

@rxin
Copy link
Contributor

rxin commented Nov 2, 2014

Let's not change the old API. Just deprecate it, and then add this tripletFields thing to the new API Ankur is going to submit.

ankurdave added a commit to ankurdave/spark that referenced this pull request Nov 5, 2014
aggregateMessages enables neighborhood computation similarly to
mapReduceTriplets, but it introduces two API improvements:

1. Messages are sent using an imperative interface based on EdgeContext
rather than by returning an iterator of messages. This is more
efficient, providing a 20.2% speedup on PageRank over
apache#3054 (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge
machines, sped up from 403 s to 322 s).

2. Rather than attempting bytecode inspection, the required triplet
fields must be explicitly specified by the user by passing a
TripletFields object. This fixes SPARK-3936.

Subsumes apache#2815.
ankurdave added a commit to ankurdave/spark that referenced this pull request Nov 12, 2014
aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:

1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.

2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.

Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:

1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.

2. Internal iterators in aggregateMessages are inlined into a while loop.

In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).

Subsumes apache#2815. Also fixes SPARK-4173.

Author: Ankur Dave <ankurdave@gmail.com>

Closes apache#3100 from ankurdave/aggregateMessages and squashes the following commits:

f5b65d0 [Ankur Dave] Address @rxin comments on apache#3054 and apache#3100
1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
c85076d [Ankur Dave] Readability improvements
b567be2 [Ankur Dave] iter.foreach -> while loop
4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition
asfgit pushed a commit that referenced this pull request Nov 12, 2014
aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:

1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.

2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.

Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:

1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.

2. Internal iterators in aggregateMessages are inlined into a while loop.

In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).

Subsumes #2815. Also fixes SPARK-4173.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #3100 from ankurdave/aggregateMessages and squashes the following commits:

f5b65d0 [Ankur Dave] Address @rxin comments on #3054 and #3100
1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
c85076d [Ankur Dave] Readability improvements
b567be2 [Ankur Dave] iter.foreach -> while loop
4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition

(cherry picked from commit faeb41d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@jegonzal jegonzal closed this Nov 12, 2014
@balduz
Copy link

balduz commented Apr 20, 2015

Is the Pregel API going to be updated so that it uses aggregateMessages instead of mapReduceTriplets? Or is the whole Pregel class going to be deprecated in favor of this method?

@rxin
Copy link
Contributor

rxin commented Apr 20, 2015

We should update Pregel to use aggregateMessages instead. Would you like to submit a patch for that?

@ankurdave
Copy link
Contributor

@rxin @balduz We can update Pregel to use aggregateMessages without changing the API, but it won't provide any performance benefit since we'll still be handling iterators. Ultimately, the plan is to add a new version of Pregel that uses an aggregateMessages-style interface and has some other changes to expose activeness more explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants