-
Notifications
You must be signed in to change notification settings - Fork 845
Return current parent max retry count for parent proxy #9620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Why is |
|
It looks like this logic is not used when next hop is determined using strategies.yaml. If the consensus is parent.config is headed for deprecation, we should consider that is deciding how much effort to put into this. |
9df3972 to
7f10f34
Compare
|
Maybe the docs need to be updated then since it seems like these configs are used for attempts before failing a request, not making the parent down. Or is that implied? It seems like for the parents, there are 2 kinds of attempt limits: 1 for the connection attempts for a specific transaction before failing it and 1 for the connection attempts on a parent before making it unavailable. And from my previous comment, it seems like |
proxy/http/HttpTransact.h
Outdated
| // If the current attempt is already a multiple of ppca, get the next | ||
| // multiple. | ||
| cur_parent_max_attempts += ppca; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you want this if statement. If ppca is 10 and cur_attempts is 20, that means you're on the last try to connect to the second origin in the next hop protection group. If there's a non-retryable error, you don' t want to raise the number of attempts to 30. That would mean you'd never try the last origin in the next hop protection group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, when cur_attempts is already a multiple of ppca when calling this function, the calculation would always return the same cur_attempts, which leads to an assertion failure in increment() down the line.
I do see your point though and we don't want to skip parents. Question: when cur_attempts is 20, is it the last try for the second origin, or is it the first try for the last origin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call:
trafficserver/proxy/http/HttpTransact.cc
Line 3646 in 6e8933f
| s->current.attempts.increment(s->configured_connect_attempts_max_retries()); |
configured_connect_attempts_max_retries() can't be more than txn_conf->parent_connect_attempts. The test in the if statement at line 3644 ensures that an assert is impossible if execution reaches line 3644.
The only other call to the increment function is
trafficserver/proxy/http/HttpTransact.cc
Line 3865 in 6e8933f
| s->current.attempts.increment(s->configured_connect_attempts_max_retries()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call:
trafficserver/proxy/http/HttpTransact.cc
Line 3646 in 6e8933f
s->current.attempts.increment(s->configured_connect_attempts_max_retries()); can't cause an assert
To clarify, the SDK_API_HttpParentProxySet_Fail regression test failed at this assert without the if statement. The problem doesn't seem to be current.attempts being too large, but the value from configured_connect_attempts_max_retries() being too small. Consider the following case:
current.attempts: 4per_parent_connect_attempts: 2parent_proxy.total_connect_attempts: 100
in this case, without the if block, the configured_connect_attempts_max_retries() would return 4 and pass that into increment(). Assertion in the increment() as after the increment, current.attempts goes pass the input "max".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. I think then you should remove the assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But the same logic applies when calling maximize() with the configured_connect_attempts_max_retries():
current.attempts : 0
per_parent_connect_attempts: 2
parent_proxy.total_connect_attempts: 100
the configured_connect_attempts_max_retries() returns 0, which is passed into maximize(). As a result, current.attempts is unchanged after the call, while what we want is to "maximize" the attempt count so that the current parent is skipped. There seems to be an issue here too. Do you have any suggestions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the code, I thought the attempt numbers are 1 to N. That is the number is 1 during the first attempt, not zero. So the attempt number would never be zero when maximize() is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see now, "retries" doesn't include the first try, so max tries = max retries + 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
MgmtInt
configured_connect_attempts_max_retries() const
{
if (parent_params != nullptr) {
auto ppca = txn_conf->per_parent_connect_attempts;
auto tries = current.attempts.get() + 1;
// Round g up to next multiple of ppca.
auto ru = ((tries + ppca - 1) / ppca) * ppca;
return std::min(ru, txn_conf->parent_connect_attempts) - 1;
}
return txn_conf->connect_attempts_max_retries;
}
But this would also require removing the parameter and assert in CurrentInfo::Attempts::increment(),
and changing line 3644 in HttpTransact.cc to:
if (s->current.attempts.get() < (s->txn_conf->parent_connect_attempts - 1)) {
and changing line 3649 in HttpTransact.cc to:
if (s->current.attempts.get() % s->txn_conf->per_parent_connect_attempts != 0) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep after some more digging and thoughts, I believe this is correct. I reach out to Susan and she confirms that the current HttpTransact::_CurrentInfo.attempt is really the retry attempt. Based on this understanding, what you suggested here makes a lot of sense and I think we are good to go.
|
@serrislew Thanks for the question. I share the same question as to what the This As for Based on the above, my impression is that |
| increment() | ||
| { | ||
| ++_v; | ||
| ink_assert(_v <= configured_connect_attempts_max_retries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldn't this assert apply now with the new changes? seems like increment doesn't have to change now that you check attempts - 1 before you increment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. The check before the increment ensures that attempts - 1 < the global parent_max_retries. The problem with this assert is that the argument passed in configured_connect_attempts_max_retries may not be the global parent max retry, but the max retry count for the current parent. Currently, the argument is set to the the output of s->configured_connect_attempts_max_retries(), which is returning:
- if connecting to the OS, returns the
txn_conf->connect_attempts_max_retries - if connecting to a parent proxy, returns the max retries for the current parent(bounded by the global max retry for all parents)
In the second case, the assert can fail as after the increment, it went over the max for the current parent(which should be allowed as we switch to the next parent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but doesn't this line auto cur_parent_max_attempts = ((cur_tries + ppca - 1) / ppca) * ppca; take into account the max retries for the parent but also in relation to the total? even when we hit the max retries for the current parent, don't your new changes in configured_connect_attempts_max_retries() account for the next parent. each increment should be protected. im trying to find the edge case when this assert fails, could you provide an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert can be triggered if parent_connect_attempts > (connect_attempts_max_retries + 1). If you can find checking logic in the config code that enforces this, it would be safe to restore the assert.
This was the original version of increment():
void
increment()
{
++_v;
ink_assert(_v <= configured_connect_attempts_max_retries());
}
This was CurrentInfo::configured_connect_attempts_max_retries():
inline MgmtInt
HttpTransact::CurrentInfo::Attempts::configured_connect_attempts_max_retries() const
{
return reinterpret_cast<const HttpTransact::State *>(reinterpret_cast<const char *>(this) -
offsetof(HttpTransact::State, current.attempts))
->configured_connect_attempts_max_retries();
}
Even though this code was verified in Yahoo prod, and this sort of pointer arithmetic is how inheritance is implemented in C++, this code was rejected in review. I never would have put in the assert if I thought it would require adding a parameter to increment().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even when we hit the max retries for the current parent, don't your new changes in
configured_connect_attempts_max_retries()account for the next parent?
configured_connect_attempts_max_retries() normally would be increment to the next multiple of ppca(accounts for the next parent), except for the case where the attempt/(retry + 1) is already a multiple of ppca(current parent hits max retries), in which case the calculation basically returns the same attempt/retry count as before.
im trying to find the edge case when this assert fails, could you provide an example
The above logic make sense when we call maximize() with the configured_connect_attempts_max_retries() to disable further retries, since we already hit the max retries for the current parent, we are not setting it to anything larger.
However, it will fail the assert in increment() with the configured_connect_attempts_max_retries(), since n = configured_connect_attempts_max_retries(n).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I misunderstood when the actual attempts_max_retries was being called
proxy/http/HttpTransact.cc
Outdated
| return; | ||
| } | ||
|
|
||
| if (s->current.attempts.get() < s->txn_conf->parent_connect_attempts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this an off by one error then? if s->txn_conf->parent_connect_attempts is 3 so we can try 3 times and we are on the 3rd attempt but since we don't increment yet, s->current.attempts.get() = 2 so that conditional would fail so we don't try the 3rd time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think the confusion arises from the naming. s->current.attempts.get() is actually the retry count, so when s->current.attempts.get() = 2, it has retried for 2 times already(3 attempts in total).
The naming confusion is addressed in another PR: #9655 .
f2af355 to
e0f4cc2
Compare
|
@serrislew Zhengxi has been working on this in the context of getting -R 3 regression testing working. Yahoo doesn't use parent.config. Comcast is moving off of it or has moved off of it to strategies.yaml. So I think Apple should have a free hand in determining how the relevant configs should work. You could maybe bring it up in the Monday meeting or have Leif do it. |
|
[approve ci autest] |
* asf/master: (33 commits) Add error log for invalid OCSP response (apache#9674) Add new settings to specify TLS versions to use (apache#9667) Remove flask from tests/Pipfile (apache#9688) Doc: Add example of --enable-lto build option with LLVM (apache#9654) Added Zhengxi to the asf contributors (apache#9685) Don't build native QUIC implementation (apache#9670) Stabilize autest tls_hooks17 (apache#9671) Cleanup: remove ts::buffer from HostDB. (apache#9677) Fix leak in MultiTextMod in ControlBase. (apache#9675) Cleanup: remove TsBuffer.h from MIME.cc (apache#9661) Cleanup: remove ts::Buffer from ControlBase. (apache#9664) setup pre-commit hook at cmake generation time (apache#9669) Updated parent retry attempt logic (apache#9620) Try to do less work in hot function HttpHookState::getNext (apache#9660) cmake build, fixed warning using older openssl APIs (apache#9648) rename attempts to retry_attempts (apache#9655) OCSP: Fix unitialized variable error. (apache#9662) Add support for CONNECT method on HTTP/2 connection (apache#9616) Remove deprecated debug output functions from 10 source files. (apache#9657) Remove deprecated debug output functions from 14 source files. (apache#9658) ...
As titled. This prevents the
SDK_API_HttpParentProxySet_Failregression test from crashing.This resolves #9576. (
SDK_API_HttpParentProxySet_Failis the only test that's crashing/failing)