Skip to content

Conversation

@danny0405
Copy link

@danny0405 danny0405 commented Jan 13, 2017

For legacy reasons, config TOPOLOGY-TASKS is considered first when schedule a topology, for a component, if user don’t specify TOPOLOGY-TASKS, storm just override it to be equal to component parallelism hint, and schedule based on TOPOLOGY-TASKS later on.

This works for the most cases, but not Rebalance command. Now, when do Rebalance, the StormBase :component->executors attribute will be overridden in Zookeeper which is used to partition component tasks into executors, as we said above, the TOPOLOGY-TASKS is considered here as the real tasks number for components, something goes weird here:

If we override a bigger executor numbers for a component when do rebalance, it just don’t work because smaller TOPOLOGY-TASKS [ not changed since first submitted at all ]is partitioned into bigger number of executors which read from ZooKeeper overridden by Rebalance command, but for smaller task, it works fine.

I see that storm support a command like this now: [storm rebalance topology-name [-w wait-time-secs] [-n new-num-workers] [-e component=parallelism]*] which indicate that user can override a component parallelism freely, i think it’s more sensible to support this and it's meaningless to have a restriction like before.

This is the jira task: STORM-2286

@danny0405 danny0405 changed the title Strom Rebalance command should support arbitrary component parallelism STORM-2286 Strom Rebalance command should support arbitrary component parallelism Jan 13, 2017
@danny0405
Copy link
Author

@erikdw Could you please help me to review this?

@erikdw
Copy link
Contributor

erikdw commented Jan 16, 2017

@danny0405 : given that I personally find no value in having this separate task vs. executor distinction that Storm supports (i.e., allowing for a different ratio than 1:1 for tasks to executors), I'm probably not the best person to review this. :) But it would help if you gave some specific examples, instead of only talking abstractly about the problem. Also just FYI, you have a typo in the headline for this PR as well as the storm ticket (Strom --> Storm).

@danny0405 danny0405 changed the title STORM-2286 Strom Rebalance command should support arbitrary component parallelism STORM-2286 Storm Rebalance command should support arbitrary component parallelism Jan 17, 2017
@danny0405
Copy link
Author

danny0405 commented Jan 17, 2017

@erikdw thx for your nice review, i have fixed the headline and ticket title.

For this PR, the ratio is still 1:1 for tasks and executors for a component, but not for the Storm before this PR [for rebalance command].

The key here is that, for the old storm, TOPOLOGY-TASKS is not changed for a component when we do rebalance, we only change the executors number. Thus, when we rebalance a smaller executors number for a component, the old bigger TOPOLOGY-TASKS is partitioned into smaller number of executors, it works [we finally reduce the executors], but some executors will have more than one tasks; When we rebalance a bigger executors number for a component, it doesn't work at all because the old smaller TOPOLOGY-TASKS is partitioned into bigger number of executors, so some executors will not have task assigned and will not start.

For example, for the old storm, if we have a topology named example-topology which contains component like Spout S1[1 executors] -- Bolt B1[2 executors] -- Bolt B2[4 executors], if we do a command storm rebalance example-topology -e B1=1 it will work, and the only executor for B1 will have 2 tasks inside; but for command storm rebalance example-topology -e B1=4, it doesn't work at all and nothing changes finally.

For this PR, all the commands above work, and the ratio for tasks and executors of a component is always 1:1

@revans2 @erikdw please help me to review this again, thx very much ^_^

@erikdw
Copy link
Contributor

erikdw commented Jan 17, 2017

I'm unable to reconcile the statements about the ratio always being 1:1 with the content of your "For example" paragraph. Those examples sound like you have a different number of executors than tasks. Am I misreading them?

In any case, I'm not a clojure expert, so you probably wanna get a core contributor (e.g., Bobby Evans, Jungtaek Lim, someone else) to look at this.

However, another Q that comes to mind is: does this behavior also exist in the master branch at HEAD? i.e., the Java-based code. If so then you'll wanna do this change in Java too, right?

@arunmahadevan
Copy link
Contributor

@danny0405 right now rebalance to a higher number of executors is meaningful only if a component was provisioned with more number of tasks than executors during deploy time.

May be there is a reason why the number of tasks as not changed with rebalance. Say if the tasks are maintaining some local state and the inputs to the tasks are keyed on a field and the rebalance alters the number of tasks, the states will not be valid anymore since the same keys will not be mapped to the same tasks anymore.

The idea of dynamically increasing the tasks with rebalance is good, but we need to think through how to handle the key based partitioning use cases.

@danny0405
Copy link
Author

danny0405 commented Jan 17, 2017

@arunmahadevan There are over 2500+ topologies on our cluster, in most cases, we wanna increase component executors too when we add workers to a topology in order to make full use of these workers and also make these executors assigned more evenly among workers. It's awful and time wasting to kill and re-submit the topology just for increasing workers and executors.

As you said, there may be some problem with grouping or something, i need to make sure, thx for your suggestion.

@danny0405
Copy link
Author

danny0405 commented Jan 17, 2017

@HeartSaVioR @arunmahadevan @revans2

I see that now for a rebalancing topology, storm will mark all its slots as free to use for the next round scheduling and is scheduled with other topologies together, this means that it is very possible to schedule to different slots. Here is the snippet:

existing-assignments (into {} (for [tid assigned-topology-ids]
                                        ;; for the topology which wants rebalance (specified by the scratch-topology-id)
                                        ;; we exclude its assignment, meaning that all the slots occupied by its assignment
                                        ;; will be treated as free slot in the scheduler code.
                                        (when (or (nil? scratch-topology-id) (not= tid scratch-topology-id))
                                          {tid (.assignment-info storm-cluster-state tid nil)})))

If we change a component's executors to be smaller, some executors will have more than one tasks which is not accord with the design pattern now. Also the worker containing the changed executors will be restarted.

So why we just kill all its old workers and let them all be reassigned? It just makes no much sense to just keep a few old workers and re-launcher the others.

Personally, i think we should re-launcher all the workers and let the executors be assigned evenly when rebalancing. So i update the PR to change the strategy.

Can you give some suggestions to me? I appreciate it very much.

d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
d2r pushed a commit to d2r/storm that referenced this pull request Oct 16, 2018
We are closing stale Pull Requests to make the list more manageable.

Please re-open any Pull Request that has been closed in error.

Closes apache#608
Closes apache#639
Closes apache#640
Closes apache#648
Closes apache#662
Closes apache#668
Closes apache#692
Closes apache#705
Closes apache#724
Closes apache#728
Closes apache#730
Closes apache#753
Closes apache#803
Closes apache#854
Closes apache#922
Closes apache#986
Closes apache#992
Closes apache#1019
Closes apache#1040
Closes apache#1041
Closes apache#1043
Closes apache#1046
Closes apache#1051
Closes apache#1078
Closes apache#1146
Closes apache#1164
Closes apache#1165
Closes apache#1178
Closes apache#1213
Closes apache#1225
Closes apache#1258
Closes apache#1259
Closes apache#1268
Closes apache#1272
Closes apache#1277
Closes apache#1278
Closes apache#1288
Closes apache#1296
Closes apache#1328
Closes apache#1342
Closes apache#1353
Closes apache#1370
Closes apache#1376
Closes apache#1391
Closes apache#1395
Closes apache#1399
Closes apache#1406
Closes apache#1410
Closes apache#1422
Closes apache#1427
Closes apache#1443
Closes apache#1462
Closes apache#1468
Closes apache#1483
Closes apache#1506
Closes apache#1509
Closes apache#1515
Closes apache#1520
Closes apache#1521
Closes apache#1525
Closes apache#1527
Closes apache#1544
Closes apache#1550
Closes apache#1566
Closes apache#1569
Closes apache#1570
Closes apache#1575
Closes apache#1580
Closes apache#1584
Closes apache#1591
Closes apache#1600
Closes apache#1611
Closes apache#1613
Closes apache#1639
Closes apache#1703
Closes apache#1711
Closes apache#1719
Closes apache#1737
Closes apache#1760
Closes apache#1767
Closes apache#1768
Closes apache#1785
Closes apache#1799
Closes apache#1822
Closes apache#1824
Closes apache#1844
Closes apache#1874
Closes apache#1918
Closes apache#1928
Closes apache#1937
Closes apache#1942
Closes apache#1951
Closes apache#1957
Closes apache#1963
Closes apache#1964
Closes apache#1965
Closes apache#1967
Closes apache#1968
Closes apache#1971
Closes apache#1985
Closes apache#1986
Closes apache#1998
Closes apache#2031
Closes apache#2032
Closes apache#2071
Closes apache#2076
Closes apache#2108
Closes apache#2119
Closes apache#2128
Closes apache#2142
Closes apache#2174
Closes apache#2206
Closes apache#2297
Closes apache#2322
Closes apache#2332
Closes apache#2341
Closes apache#2377
Closes apache#2414
Closes apache#2469
@asfgit asfgit closed this in #2880 Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants