ZEPPELIN-539 RemoteInterpreter Heartbeat (WIP) #576

HeartSaVioR · 2015-12-28T22:19:07Z

What is this PR for?

To help users to determine remote interpreter is not able to respond.
This is just WIP, and "How to help users" could be improved with discussions.

What type of PR is it?

Feature (Improve?)

Todos

- Rebase when [ZEPPELIN-535] "Scheduler already terminated" occurs when RemoteInterpreter.close() doesn't succeed #574 is merged to master
- since shutdowning heartbeat threads requires reference count to be zero
- without [ZEPPELIN-535] "Scheduler already terminated" occurs when RemoteInterpreter.close() doesn't succeed #574 some threads could be alive although remote interpreter is closed
- Discuss proper values for sending heartbeat interval, checking timeout interval
- Discuss how to let users know when remote interpreter is timed out
- Discuss possible way to restore remote interpreter back to normal

Is there a relevant Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-539

How should this be tested?

run a spark paragraph to ensure spark remote interpreter process is run
kill -9 to spark remote interpreter process
run paragraph again (it may show broken pipe, or connection refused after ZEPPELIN-534 Discard broken thrift Client instance #575)
wait 30 secs (or remote interpreter connection timeout value) to let RemoteInterpreterProcess classifies process to be timed out
run paragraph again (it shows org.apache.zeppelin.interpreter.InterpreterProcessHeartbeatFailedException to users)

Screenshots (if appropriate)

### Questions: - Does the licenses files need update? (No) - Is there breaking changes for older versions? (No) - Does this needs documentation? (Maybe no)

HeartSaVioR · 2015-12-28T22:20:54Z

I couldn't create an unit test for this since it requires remote interpreter process to be stopped or killed.
Please share the idea if you have one.

HeartSaVioR · 2015-12-28T22:32:30Z

Discuss proper values for sending heartbeat interval, checking timeout interval

Currently, RemoteInterpreterProcess sends heartbeat every 1 sec, and also checks timeout every 1 sec.
Please let me know if you think it is too often.

Discuss how to let users know when remote interpreter is timed out

Currently, Zeppelin can notice users via showing InterpreterProcessHeartbeatFailedException when executing any client-related works.
Please share your ideas to improve this feature.

Discuss possible way to restore remote interpreter back to normal

At first, I thought it would be good to force-kill and re-run remote interpreter process automatically when timed out. But remote interpreter process is stateful so users may not want it.
#480 could be a good way to let users restart interpreter by hand if problem occurs.
Please share your ideas to improve this feature.

HeartSaVioR · 2015-12-29T01:39:30Z

We can even provide way to validate Client instance with new "ping".

Implement validateObject() to ClientFactory.

  @Override
  public boolean validateObject(PooledObject<Client> p) {
    final Client client = p.getObject();

    try {
      return client.ping().equals("pong");
    } catch (Exception e) {
      return false;
    }
  }

Build a strategy to exclude invalid objects via providing GenericObjectPoolConfig to GenericObjectPool.

https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/GenericObjectPoolConfig.html

We can even set max idle, min idle, max total which helps to control total thrift connections per remote interpreter process.

If we think it would be better to have, I'll address to this PR, or another PR. (when we don't want to merge this PR)

* introduce "ping" function to thrift * every remote interpreter processes will have two additional threads * send "ping" to check that remote interpreter process is able to respond * check last heartbeat timestamp and determine it's timed out * introduce InterpreterProcessHeartbeatFailedException * thrown when remote interpreter process is determined to timed out

HeartSaVioR · 2015-12-29T05:02:56Z

Actually this is adoption of apache/storm#286 from Apache Storm.

corneadoug · 2016-09-26T07:57:14Z

@HeartSaVioR It's been quite some time, should we close this PR?

HeartSaVioR · 2016-09-26T08:06:35Z

OK. This is already staled. I'll close this.
@corneadoug Thanks for noticing!

HeartSaVioR added 2 commits December 29, 2015 11:12

ZEPPELIN-539 Add missing license header

5563a7b

HeartSaVioR force-pushed the ZEPPELIN-539-WIP-v1 branch from d27152e to 5563a7b Compare December 29, 2015 02:13

HeartSaVioR mentioned this pull request Dec 29, 2015

ZEPPELIN-534 Discard broken thrift Client instance #575

Closed

HeartSaVioR closed this Sep 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZEPPELIN-539 RemoteInterpreter Heartbeat (WIP) #576

ZEPPELIN-539 RemoteInterpreter Heartbeat (WIP) #576

Uh oh!

HeartSaVioR commented Dec 28, 2015

Uh oh!

HeartSaVioR commented Dec 28, 2015

Uh oh!

HeartSaVioR commented Dec 28, 2015

Uh oh!

HeartSaVioR commented Dec 29, 2015

Uh oh!

HeartSaVioR commented Dec 29, 2015

Uh oh!

corneadoug commented Sep 26, 2016

Uh oh!

HeartSaVioR commented Sep 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZEPPELIN-539 RemoteInterpreter Heartbeat (WIP) #576

ZEPPELIN-539 RemoteInterpreter Heartbeat (WIP) #576

Uh oh!

Conversation

HeartSaVioR commented Dec 28, 2015

What is this PR for?

What type of PR is it?

Todos

Is there a relevant Jira issue?

How should this be tested?

Screenshots (if appropriate)

Uh oh!

HeartSaVioR commented Dec 28, 2015

Uh oh!

HeartSaVioR commented Dec 28, 2015

Uh oh!

HeartSaVioR commented Dec 29, 2015

Uh oh!

HeartSaVioR commented Dec 29, 2015

Uh oh!

corneadoug commented Sep 26, 2016

Uh oh!

HeartSaVioR commented Sep 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants