[STORM-2713] when the connection to the first zkserver is timeout,storm-kafka's kafkaspout will throw a exception #2297

liu-zhaokun · 2017-08-30T01:57:21Z

https://issues.apache.org/jira/browse/STORM-2713
when the connection to the first zkserver is timeout,storm-kafka's kafkaspout will throw a exception without attempting to connect other zkserver,even zk can also work with one node down.

…rm-kafka's kafkaspout will throw a exception

HeartSaVioR · 2017-08-31T03:23:18Z

external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java

            return children.size();
        } catch (Exception e) {
-            throw new RuntimeException(e);
+            LOG.error(e.getMessage(), e);


Which value do you think this method returns when exception is caught, which action caller should take? Without proper action for failing back, it is better to throw exception and fail-fast.

And compilation fails on the change. Please see the result of Travis build.

@HeartSaVioR
When exception is caught,we should only print the error messages,so that users can be aware of there are some problems in their cluster such as one role of zk was down.

@HeartSaVioR
For example,one role of zk is down in users cluster,storm-kafka will throw a exception,even storm also can connect another zk role and works.I will check the result of Travis build.

When RuntimeException is propagated to call hierarchy, some place will leave a error message. (it is guaranteed when worker is crashing)

Btw, I guess you may need to think and try to answer my questions above by yourself.
Please note that core principal of Storm is 'fail-fast', because in many case things are not restorable and retryable, and we had to kill the process to prevent process to go with unexpected behavior.

That may be viable and possible, but you may need to add retry logic rather than just removing throw statement.

I think it doesn't affect 'fail-fast',I just want to guarantee users' Topology can work normally.

@HeartSaVioR
I mean topology should work even we can't connect one of the zkservers,because the data is shared among zkservers. But if we throw a exception,storm will not attempt to connect other zkserver,and exit.

srdo · 2017-09-19T20:34:31Z

I think Curator is supposed to handle retry and partial connection loss to the Zookeeper cluster for us. I'd rather we use Curator for this instead of trying to add our own retry loops around it.

liu-zhaokun · 2017-09-21T02:42:09Z

@srdo
I don't think it's related to retry, the key is how storm will do when the first zkserver is down.

srdo · 2017-09-21T16:17:12Z

@liu-zhaokun Yes, that's what I'm talking about. If the first zkserver is down, I think Curator is supposed to try another server in the connection string.

liu-zhaokun · 2017-09-23T03:21:09Z

@srdo
I guess you support my PR,right?

srdo · 2017-09-23T09:50:08Z

@liu-zhaokun No, not as it is currently. I think Curator already handles failing over to another Zookeeper server for us, so if the user passes a zkStr with multiple servers here https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java#L61 and sets the retry policy at https://github.com/apache/storm/blob/master/external/storm-kafka/src/jvm/org/apache/storm/kafka/DynamicBrokersReader.java#L64 to a few attempts, then I don't see why we need to add another layer of retries on top of that?

Plus as @HeartSaVioR points out, this doesn't compile.

liu-zhaokun · 2017-09-27T02:31:09Z

@srdo
Thanks for your reply.I mean if curator has retried several times and failed,the code where I modify will throw a exception,and it will never attempt to connect another zkserver. Could you help me solve doubts?
I will check the CI later.

We are closing stale Pull Requests to make the list more manageable. Please re-open any Pull Request that has been closed in error. Closes apache#608 Closes apache#639 Closes apache#640 Closes apache#648 Closes apache#662 Closes apache#668 Closes apache#692 Closes apache#705 Closes apache#724 Closes apache#728 Closes apache#730 Closes apache#753 Closes apache#803 Closes apache#854 Closes apache#922 Closes apache#986 Closes apache#992 Closes apache#1019 Closes apache#1040 Closes apache#1041 Closes apache#1043 Closes apache#1046 Closes apache#1051 Closes apache#1078 Closes apache#1146 Closes apache#1164 Closes apache#1165 Closes apache#1178 Closes apache#1213 Closes apache#1225 Closes apache#1258 Closes apache#1259 Closes apache#1268 Closes apache#1272 Closes apache#1277 Closes apache#1278 Closes apache#1288 Closes apache#1296 Closes apache#1328 Closes apache#1342 Closes apache#1353 Closes apache#1370 Closes apache#1376 Closes apache#1391 Closes apache#1395 Closes apache#1399 Closes apache#1406 Closes apache#1410 Closes apache#1422 Closes apache#1427 Closes apache#1443 Closes apache#1462 Closes apache#1468 Closes apache#1483 Closes apache#1506 Closes apache#1509 Closes apache#1515 Closes apache#1520 Closes apache#1521 Closes apache#1525 Closes apache#1527 Closes apache#1544 Closes apache#1550 Closes apache#1566 Closes apache#1569 Closes apache#1570 Closes apache#1575 Closes apache#1580 Closes apache#1584 Closes apache#1591 Closes apache#1600 Closes apache#1611 Closes apache#1613 Closes apache#1639 Closes apache#1703 Closes apache#1711 Closes apache#1719 Closes apache#1737 Closes apache#1760 Closes apache#1767 Closes apache#1768 Closes apache#1785 Closes apache#1799 Closes apache#1822 Closes apache#1824 Closes apache#1844 Closes apache#1874 Closes apache#1918 Closes apache#1928 Closes apache#1937 Closes apache#1942 Closes apache#1951 Closes apache#1957 Closes apache#1963 Closes apache#1964 Closes apache#1965 Closes apache#1967 Closes apache#1968 Closes apache#1971 Closes apache#1985 Closes apache#1986 Closes apache#1998 Closes apache#2031 Closes apache#2032 Closes apache#2071 Closes apache#2076 Closes apache#2108 Closes apache#2119 Closes apache#2128 Closes apache#2142 Closes apache#2174 Closes apache#2206 Closes apache#2297 Closes apache#2322 Closes apache#2332 Closes apache#2341 Closes apache#2377 Closes apache#2414 Closes apache#2469

[STORM-2713] when the connection to the first zkserver is timeout,sto…

aa46372

…rm-kafka's kafkaspout will throw a exception

HeartSaVioR reviewed Aug 31, 2017

View reviewed changes

d2r mentioned this pull request Oct 16, 2018

STORM-3250: Closes Pull Requests unchanged in 2018 #2880

Merged

asfgit closed this in #2880 Oct 22, 2018

[STORM-2713] when the connection to the first zkserver is timeout,storm-kafka's kafkaspout will throw a exception #2297

[STORM-2713] when the connection to the first zkserver is timeout,storm-kafka's kafkaspout will throw a exception #2297

Uh oh!

Conversation

liu-zhaokun commented Aug 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srdo commented Sep 19, 2017

Uh oh!

liu-zhaokun commented Sep 21, 2017

Uh oh!

srdo commented Sep 21, 2017

Uh oh!

liu-zhaokun commented Sep 23, 2017

Uh oh!

srdo commented Sep 23, 2017

Uh oh!

liu-zhaokun commented Sep 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants