Broadcast and DOP Issue in Flink #32

carabolic · 2016-05-30T14:47:39Z

It seems that there is an issue with how Flink handles broadcast DataSets.

Problem

Let's assume we have a Flink cluster with N = 20 nodes and T = 2 tasks per node, hence DOP = 20 * 2 = 40. If we now have a job that reads inputSize = 5mb of data into a single dataset and consecutively broadcast this dataset to the mappers (with max DOP), the data gets broadcasted to every mapper in isolation which means broadcastSize = DOP * inputSize = 40 * 5mb = 200mb need to transferred over the network.

In our case it becomes obvious when running the LinRegDS.dml script on flink_hybrid. The second flink job involves MapmmFLInstruction which broadcasts the smaller matrix to all the mappers. For DOP of 250 this results in about 10GB of broadcasted data.

Solution

Since all tasks per node run in the same JVM it would be better to simply broadcast to the taskmanagers only, which then pass a simple reference to the single task they are responsible for. So for the example this reduces the size the broadcast to broadcastSize = N * inputSize = 20 * 5mb = 100mb.

For the LinReg.dml use-case this fix will reduce the size of the broadcast by 16. Hence it will only need to broadcast 10GB / 16 = 0.625GB of data.

Workaround

For now this could be fixed if the dop is set really low for jobs that include a broadcast.

Follow Up

I will investigate a little bit more to see if this is a known issue for flink and if there are already ways to work around the problem, even maybe opening a PR with Flink.

The text was updated successfully, but these errors were encountered:

aalexandrov · 2016-05-30T21:12:33Z

@carabolic This can also explain some of the results observed by @bodegfoh with respect to the logreg benchmark of Spark vs. Flink

akunft · 2016-05-31T10:08:54Z

Would be nice if we can have a simple example showing this and open a jira issue in Flink.
@carabolic Could you do that?

aalexandrov · 2016-05-31T10:21:09Z

@carabolic Maybe you can bootstrap a new Peel bundle, modify the wordcount job to something better that highlights the issue, execute this on one of the clusters, and provide the chart as part of the jira.

The IBM machines have 48 cores on each node, which should make the effect quite visible.

carabolic · 2016-05-31T14:17:49Z

I'm preparing a bundle right now. The idea is to generate a vector: DataSet[Long]s with |vector| = (s * 1024 * 1024) / 8 elements where s is the desired size in mb that is to be broadcasted. This DataSet vector is then broadcasted to each mapper. The number of mappers can be controlled by a dop parameter.

aalexandrov · 2016-05-31T15:21:32Z

Sounds good.

aalexandrov · 2016-05-31T15:22:57Z

To make sure that N mappers are started, you can instantiate using fromParallelCollection with a NumberSequenceIterator with N elements and subsequently set the DOP to N.

environment
  .fromParallelCollection(new NumberSequenceIterator(1, N))
  .setParallelism(N)

carabolic · 2016-05-31T21:29:13Z

I've uploaded the initial version of the bundle to GitHub. And our assumption seems to be true. Here are my initial results for a 10mb broadcast DataSet obtained manually from the Flink web frontent:

#TaskManager	#Tasks (total)	Size of broadcast (actual)	Size of broadcast (expected)
16	20	200MB	160MB
25	400	4000MB	250MB

And the results from running ./peel.sh query:runtimes --connection h2 broadcast.dev:

name	name	min	max	median
broadcast.dev	broadcast.scale.up.10	5309	8212	22993
broadcast.dev	broadcast.scale.up.20	7591	13909	26392
broadcast.dev	broadcast.scale.up.30	9976	14223	33992
broadcast.dev	broadcast.scale.up.400	104152	123440	343497

So the issue seems to be real. It also seems to have a devastating impact on the runtime of the jobs.

akunft · 2016-05-31T22:17:17Z

Thanks for the work.

It would be nice to have a run with stable # of task managers (25) and increasing number of slots on the task mangers (1, 2, 4, 8, 16). This would reflect the overhead of broadcasting the vector for each slot instead of once to the task manager quite nicely I think.

aalexandrov · 2016-06-01T07:07:14Z

Can we rerun with Peel rc4

aalexandrov · 2016-06-01T07:12:58Z

I fixed some data bugs and added an event extractor for Dataflows, we should be able to generate some network utilization plots per Taskmanager if you add dst at as a system dependency for your experiment.

aalexandrov · 2016-06-06T08:40:39Z

Some preliminary results from a run on cloud-11

--------------------------------------------------------------------------------------------------------------------------------------------
name                               median time (ms)                   run                                run id                             
--------------------------------------------------------------------------------------------------------------------------------------------
broadcast.01                       10799                              1                                  1506351146                         
broadcast.02                       22574                              1                                  1506352107                         
broadcast.04                       31579                              2                                  1506354030                         
broadcast.08                       59960                              2                                  1506357874                         
broadcast.16                       119608                             3                                  1506385744                         
--------------------------------------------------------------------------------------------------------------------------------------------

akunft · 2016-06-06T15:11:57Z

I think these results show the problem and we can open a jira issue.
What do you think?

aalexandrov · 2016-06-06T15:28:03Z

Alright, WIP benchmark can be pointed as well.

FelixNeutatz · 2016-07-06T08:44:18Z

I have a first prototype running to solve the broadcast issue: https://github.com/FelixNeutatz/incubator-flink/commits/experimentWithBroadcast

carabolic self-assigned this May 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast and DOP Issue in Flink #32

Broadcast and DOP Issue in Flink #32

carabolic commented May 30, 2016 •

edited

Loading

aalexandrov commented May 30, 2016

akunft commented May 31, 2016

aalexandrov commented May 31, 2016

carabolic commented May 31, 2016 •

edited

Loading

aalexandrov commented May 31, 2016

aalexandrov commented May 31, 2016 •

edited

Loading

carabolic commented May 31, 2016

akunft commented May 31, 2016

aalexandrov commented Jun 1, 2016

aalexandrov commented Jun 1, 2016

aalexandrov commented Jun 6, 2016

akunft commented Jun 6, 2016

aalexandrov commented Jun 6, 2016

FelixNeutatz commented Jul 6, 2016

Broadcast and DOP Issue in Flink #32

Broadcast and DOP Issue in Flink #32

Comments

carabolic commented May 30, 2016 • edited Loading

Problem

Solution

Workaround

Follow Up

aalexandrov commented May 30, 2016

akunft commented May 31, 2016

aalexandrov commented May 31, 2016

carabolic commented May 31, 2016 • edited Loading

aalexandrov commented May 31, 2016

aalexandrov commented May 31, 2016 • edited Loading

carabolic commented May 31, 2016

akunft commented May 31, 2016

aalexandrov commented Jun 1, 2016

aalexandrov commented Jun 1, 2016

aalexandrov commented Jun 6, 2016

akunft commented Jun 6, 2016

aalexandrov commented Jun 6, 2016

FelixNeutatz commented Jul 6, 2016

carabolic commented May 30, 2016 •

edited

Loading

carabolic commented May 31, 2016 •

edited

Loading

aalexandrov commented May 31, 2016 •

edited

Loading