Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What is this PR for?

To help users to determine remote interpreter is not able to respond.
This is just WIP, and "How to help users" could be improved with discussions.

What type of PR is it?

Feature (Improve?)

Todos

Is there a relevant Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-539

How should this be tested?

  1. run a spark paragraph to ensure spark remote interpreter process is run
  2. kill -9 to spark remote interpreter process
  3. run paragraph again (it may show broken pipe, or connection refused after ZEPPELIN-534 Discard broken thrift Client instance #575)
  4. wait 30 secs (or remote interpreter connection timeout value) to let RemoteInterpreterProcess classifies process to be timed out
  5. run paragraph again (it shows org.apache.zeppelin.interpreter.InterpreterProcessHeartbeatFailedException to users)

Screenshots (if appropriate)

2015-12-29 7 34 04

### Questions: - Does the licenses files need update? (No) - Is there breaking changes for older versions? (No) - Does this needs documentation? (Maybe no)

@HeartSaVioR
Copy link
Contributor Author

I couldn't create an unit test for this since it requires remote interpreter process to be stopped or killed.
Please share the idea if you have one.

@HeartSaVioR
Copy link
Contributor Author

Discuss proper values for sending heartbeat interval, checking timeout interval

Currently, RemoteInterpreterProcess sends heartbeat every 1 sec, and also checks timeout every 1 sec.
Please let me know if you think it is too often.

Discuss how to let users know when remote interpreter is timed out

Currently, Zeppelin can notice users via showing InterpreterProcessHeartbeatFailedException when executing any client-related works.
Please share your ideas to improve this feature.

Discuss possible way to restore remote interpreter back to normal

At first, I thought it would be good to force-kill and re-run remote interpreter process automatically when timed out. But remote interpreter process is stateful so users may not want it.
#480 could be a good way to let users restart interpreter by hand if problem occurs.
Please share your ideas to improve this feature.

@HeartSaVioR
Copy link
Contributor Author

We can even provide way to validate Client instance with new "ping".

  1. Implement validateObject() to ClientFactory.
  @Override
  public boolean validateObject(PooledObject<Client> p) {
    final Client client = p.getObject();

    try {
      return client.ping().equals("pong");
    } catch (Exception e) {
      return false;
    }
  }
  1. Build a strategy to exclude invalid objects via providing GenericObjectPoolConfig to GenericObjectPool.

https://commons.apache.org/proper/commons-pool/apidocs/org/apache/commons/pool2/impl/GenericObjectPoolConfig.html

We can even set max idle, min idle, max total which helps to control total thrift connections per remote interpreter process.

If we think it would be better to have, I'll address to this PR, or another PR. (when we don't want to merge this PR)

* introduce "ping" function to thrift
* every remote interpreter processes will have two additional threads
  * send "ping" to check that remote interpreter process is able to respond
  * check last heartbeat timestamp and determine it's timed out
* introduce InterpreterProcessHeartbeatFailedException
  * thrown when remote interpreter process is determined to timed out
@HeartSaVioR
Copy link
Contributor Author

Actually this is adoption of apache/storm#286 from Apache Storm.

@corneadoug
Copy link
Contributor

@HeartSaVioR It's been quite some time, should we close this PR?

@HeartSaVioR
Copy link
Contributor Author

OK. This is already staled. I'll close this.
@corneadoug Thanks for noticing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants