-
Couldn't load subscription status.
- Fork 9.1k
HADOOP-19729. [ABFS][Perf] Network Profiling for Tailing Requests and Killing Bad Connections Proactively #8043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| } | ||
|
|
||
| public boolean isTailLatencyRequestTimeoutEnabled() { | ||
| return isTailLatencyRequestTimeoutEnabled && isTailLatencyTrackerEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first check should be for isTailLatencyTrackerEnabled
| return tailLatencyAnalysisWindowInMillis; | ||
| } | ||
|
|
||
| public int getTailLatencyPercentileComputationIntervalInMillis() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name should be shortened
|
|
||
| public static final boolean DEFAULT_FS_AZURE_ENABLE_CREATE_BLOB_IDEMPOTENCY = true; | ||
|
|
||
| public static final boolean DEFAULT_FS_AZURE_ENABLE_TAIL_LATENCY_TRACKER = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not want this feature to be enabled by default ?
Description of PR
Jira: https://issues.apache.org/jira/browse/HADOOP-19729
It has been observed that certain requests taking more time than expected to complete hinders the performance of whole workload. Such requests are known as tailing requests. They can be taking more time due to a number of reasons and the prominent among them is a bad network connection. In Abfs driver we cache network connections and keeping such bad connections in cache and reusing them can be bad for perf.
In this effort we try to identify such connections and close them so that new good connetions can be established and perf can be improved. There are two parts of this effort.
Identifying Tailing Requests: This involves profiling all the network calls and getting percentiles value optimally. By default we consider p99 as the tail latency and all the future requests taking more than tail latency will be considere as Tailing requests.
Proactively Killing Socket Connections: With Apache client, we can now kill the socket connection and fail the tailing request. Such failures will not be thrown back to user and retried immediately without any sleep but from another socket connection.
How was this patch tested?
New tests around both profiling and connection killing added.