Skip to content

Conversation

@lzx404243
Copy link
Collaborator

@lzx404243 lzx404243 commented Apr 18, 2023

As titled. This prevents the SDK_API_HttpParentProxySet_Fail regression test from crashing.

This resolves #9576. (SDK_API_HttpParentProxySet_Fail is the only test that's crashing/failing)

@lzx404243 lzx404243 marked this pull request as ready for review April 18, 2023 21:03
@bneradt bneradt added the Tests label Apr 18, 2023
@bneradt bneradt added this to the 10.0.0 milestone Apr 18, 2023
@serrislew
Copy link
Contributor

Why is proxy.config.http.parent_proxy.total_connect_attempts used instead of proxy.config.http.parent_proxy.fail_threshold? It seems like for the os case, proxy.config.http.connect_attempts_max_retries is the max connection retries before marking an origin as dead which is what the latter config fail_threshold uses before considering the parent unavailable

@serrislew serrislew self-requested a review April 18, 2023 22:56
@ywkaras
Copy link
Contributor

ywkaras commented Apr 21, 2023

It looks like this logic is not used when next hop is determined using strategies.yaml. If the consensus is parent.config is headed for deprecation, we should consider that is deciding how much effort to put into this.

@lzx404243 lzx404243 marked this pull request as draft April 21, 2023 20:40
@lzx404243 lzx404243 force-pushed the regression-r3-failure branch from 9df3972 to 7f10f34 Compare April 24, 2023 17:16
@lzx404243 lzx404243 marked this pull request as ready for review April 24, 2023 17:17
@lzx404243 lzx404243 changed the title Return parent max retry count when connecting to a parent proxy Return current parent max retry count for parent proxy Apr 24, 2023
@serrislew
Copy link
Contributor

Maybe the docs need to be updated then since it seems like these configs are used for attempts before failing a request, not making the parent down. Or is that implied?

It seems like for the parents, there are 2 kinds of attempt limits: 1 for the connection attempts for a specific transaction before failing it and 1 for the connection attempts on a parent before making it unavailable. And from my previous comment, it seems like proxy.config.http.connect_attempts_max_retries for OS is used to determine how many attempts are needed before marking this origin dead, which is the latter attempt limit. If it's implied that after proxy.config.http.parent_proxy.per_parent_connect_attempts is reached and so we are done with this particular parent that we will mark it down, why do we have this separate config proxy.config.http.parent_proxy.fail_threshold?

// If the current attempt is already a multiple of ppca, get the next
// multiple.
cur_parent_max_attempts += ppca;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want this if statement. If ppca is 10 and cur_attempts is 20, that means you're on the last try to connect to the second origin in the next hop protection group. If there's a non-retryable error, you don' t want to raise the number of attempts to 30. That would mean you'd never try the last origin in the next hop protection group.

Copy link
Collaborator Author

@lzx404243 lzx404243 Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, when cur_attempts is already a multiple of ppca when calling this function, the calculation would always return the same cur_attempts, which leads to an assertion failure in increment() down the line.

I do see your point though and we don't want to skip parents. Question: when cur_attempts is 20, is it the last try for the second origin, or is it the first try for the last origin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call:

s->current.attempts.increment(s->configured_connect_attempts_max_retries());
can't cause an assert. configured_connect_attempts_max_retries() can't be more than txn_conf->parent_connect_attempts. The test in the if statement at line 3644 ensures that an assert is impossible if execution reaches line 3644.

The only other call to the increment function is

s->current.attempts.increment(s->configured_connect_attempts_max_retries());
. Not clear under what circumstances this call is reached. The very large HttpTransact and HttpSM classes have little internal scoping or even encapsulation. It could takes days or weeks to analyze possible code paths. As I previously said, it's very hard to work on a source bases like this without taking leaps of faith and making reasonable assumptions. So I suggest making the reasonable assumption that a retry will be blocked if the number of retries is already at the maximum. To back that up, you could write long Au test(s) exercising parent.config (as I did for strategies.yaml). But if parent.config is soon to be deprecated, it's questionable if the effort involved is worth it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call:

s->current.attempts.increment(s->configured_connect_attempts_max_retries());

can't cause an assert

To clarify, the SDK_API_HttpParentProxySet_Fail regression test failed at this assert without the if statement. The problem doesn't seem to be current.attempts being too large, but the value from configured_connect_attempts_max_retries() being too small. Consider the following case:

  • current.attempts : 4
  • per_parent_connect_attempts: 2
  • parent_proxy.total_connect_attempts: 100

in this case, without the if block, the configured_connect_attempts_max_retries() would return 4 and pass that into increment(). Assertion in the increment() as after the increment, current.attempts goes pass the input "max".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. I think then you should remove the assert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But the same logic applies when calling maximize() with the configured_connect_attempts_max_retries():

current.attempts : 0
per_parent_connect_attempts: 2
parent_proxy.total_connect_attempts: 100

the configured_connect_attempts_max_retries() returns 0, which is passed into maximize(). As a result, current.attempts is unchanged after the call, while what we want is to "maximize" the attempt count so that the current parent is skipped. There seems to be an issue here too. Do you have any suggestions on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at the code, I thought the attempt numbers are 1 to N. That is the number is 1 during the first attempt, not zero. So the attempt number would never be zero when maximize() is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now, "retries" doesn't include the first try, so max tries = max retries + 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

    MgmtInt
    configured_connect_attempts_max_retries() const
    {
      if (parent_params != nullptr) {
        auto ppca = txn_conf->per_parent_connect_attempts;
        auto tries = current.attempts.get() + 1;
        // Round g up to next multiple of ppca.
        auto ru = ((tries + ppca - 1) / ppca) * ppca;
        return std::min(ru, txn_conf->parent_connect_attempts) - 1;
      }
      return txn_conf->connect_attempts_max_retries;
    }

But this would also require removing the parameter and assert in CurrentInfo::Attempts::increment(),
and changing line 3644 in HttpTransact.cc to:

    if (s->current.attempts.get() < (s->txn_conf->parent_connect_attempts - 1)) {

and changing line 3649 in HttpTransact.cc to:

      if (s->current.attempts.get() % s->txn_conf->per_parent_connect_attempts != 0) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep after some more digging and thoughts, I believe this is correct. I reach out to Susan and she confirms that the current HttpTransact::_CurrentInfo.attempt is really the retry attempt. Based on this understanding, what you suggested here makes a lot of sense and I think we are good to go.

@lzx404243
Copy link
Collaborator Author

@serrislew Thanks for the question. I share the same question as to what the fail_threshold does and the relationship with the other parent configs and I believe the doc can be improved to clarify that.

This configured_connect_attempts_max_retries() function is currently used when setting the attempt counts. We choose to return values based on the parent_proxy.total_connect_attempt and parent_proxy.per_parent_connect_attempts because we see the attempt count is only compared against those values, such as in https://github.com/apache/trafficserver/blob/master/proxy/http/HttpTransact.cc#L3644 .

As for parent_proxy.fail_threshold, guessing from the source code I think it might be a next-tier accounting that takes into account a timing component. From https://github.com/apache/trafficserver/blob/master/proxy/ParentSelectionStrategy.cc#L90-L93 , it seems a parent can be marked "down" several times and is only set to unavailable when it passed the fail_threshold.

Based on the above, my impression is that parent_proxy.total_connect_attempt and parent_proxy.per_parent_connect_attempts configs work with the attempt count to determine whether to use parent proxy and when to switch to the next parent. parent_proxy.fail_threshold seems to be separate from this.

increment()
{
++_v;
ink_assert(_v <= configured_connect_attempts_max_retries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't this assert apply now with the new changes? seems like increment doesn't have to change now that you check attempts - 1 before you increment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. The check before the increment ensures that attempts - 1 < the global parent_max_retries. The problem with this assert is that the argument passed in configured_connect_attempts_max_retries may not be the global parent max retry, but the max retry count for the current parent. Currently, the argument is set to the the output of s->configured_connect_attempts_max_retries(), which is returning:

  • if connecting to the OS, returns the txn_conf->connect_attempts_max_retries
  • if connecting to a parent proxy, returns the max retries for the current parent(bounded by the global max retry for all parents)
    In the second case, the assert can fail as after the increment, it went over the max for the current parent(which should be allowed as we switch to the next parent).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but doesn't this line auto cur_parent_max_attempts = ((cur_tries + ppca - 1) / ppca) * ppca; take into account the max retries for the parent but also in relation to the total? even when we hit the max retries for the current parent, don't your new changes in configured_connect_attempts_max_retries() account for the next parent. each increment should be protected. im trying to find the edge case when this assert fails, could you provide an example

Copy link
Contributor

@ywkaras ywkaras May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert can be triggered if parent_connect_attempts > (connect_attempts_max_retries + 1). If you can find checking logic in the config code that enforces this, it would be safe to restore the assert.

This was the original version of increment():

      void
      increment()
      {
        ++_v;
        ink_assert(_v <= configured_connect_attempts_max_retries());
      }

This was CurrentInfo::configured_connect_attempts_max_retries():

inline MgmtInt
HttpTransact::CurrentInfo::Attempts::configured_connect_attempts_max_retries() const
{
  return reinterpret_cast<const HttpTransact::State *>(reinterpret_cast<const char *>(this) -
                                                       offsetof(HttpTransact::State, current.attempts))
    ->configured_connect_attempts_max_retries();
}

Even though this code was verified in Yahoo prod, and this sort of pointer arithmetic is how inheritance is implemented in C++, this code was rejected in review. I never would have put in the assert if I thought it would require adding a parameter to increment().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even when we hit the max retries for the current parent, don't your new changes in configured_connect_attempts_max_retries() account for the next parent?

configured_connect_attempts_max_retries() normally would be increment to the next multiple of ppca(accounts for the next parent), except for the case where the attempt/(retry + 1) is already a multiple of ppca(current parent hits max retries), in which case the calculation basically returns the same attempt/retry count as before.

im trying to find the edge case when this assert fails, could you provide an example

The above logic make sense when we call maximize() with the configured_connect_attempts_max_retries() to disable further retries, since we already hit the max retries for the current parent, we are not setting it to anything larger.

However, it will fail the assert in increment() with the configured_connect_attempts_max_retries(), since n = configured_connect_attempts_max_retries(n).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I misunderstood when the actual attempts_max_retries was being called

return;
}

if (s->current.attempts.get() < s->txn_conf->parent_connect_attempts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an off by one error then? if s->txn_conf->parent_connect_attempts is 3 so we can try 3 times and we are on the 3rd attempt but since we don't increment yet, s->current.attempts.get() = 2 so that conditional would fail so we don't try the 3rd time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think the confusion arises from the naming. s->current.attempts.get() is actually the retry count, so when s->current.attempts.get() = 2, it has retried for 2 times already(3 attempts in total).

The naming confusion is addressed in another PR: #9655 .

@lzx404243 lzx404243 force-pushed the regression-r3-failure branch from f2af355 to e0f4cc2 Compare May 2, 2023 16:53
@ywkaras
Copy link
Contributor

ywkaras commented May 2, 2023

@serrislew Zhengxi has been working on this in the context of getting -R 3 regression testing working. Yahoo doesn't use parent.config. Comcast is moving off of it or has moved off of it to strategies.yaml. So I think Apple should have a free hand in determining how the relevant configs should work. You could maybe bring it up in the Monday meeting or have Leif do it.

@bneradt
Copy link
Contributor

bneradt commented May 2, 2023

[approve ci autest]

@bneradt bneradt merged commit a46778a into apache:master May 2, 2023
cmcfarlen pushed a commit to cmcfarlen/trafficserver that referenced this pull request Jun 3, 2024
* asf/master: (33 commits)
  Add error log for invalid OCSP response (apache#9674)
  Add new settings to specify TLS versions to use (apache#9667)
  Remove flask from tests/Pipfile (apache#9688)
  Doc: Add example of --enable-lto build option with LLVM (apache#9654)
  Added Zhengxi to the asf contributors (apache#9685)
  Don't build native QUIC implementation (apache#9670)
  Stabilize autest tls_hooks17 (apache#9671)
  Cleanup: remove ts::buffer from HostDB. (apache#9677)
  Fix leak in MultiTextMod in ControlBase. (apache#9675)
  Cleanup: remove TsBuffer.h from MIME.cc (apache#9661)
  Cleanup: remove ts::Buffer from ControlBase. (apache#9664)
  setup pre-commit hook at cmake generation time (apache#9669)
  Updated parent retry attempt logic (apache#9620)
  Try to do less work in hot function HttpHookState::getNext (apache#9660)
  cmake build, fixed warning using older openssl APIs (apache#9648)
  rename attempts to retry_attempts (apache#9655)
  OCSP: Fix unitialized variable error. (apache#9662)
  Add support for CONNECT method on HTTP/2 connection (apache#9616)
  Remove deprecated debug output functions from 10 source files. (apache#9657)
  Remove deprecated debug output functions from 14 source files. (apache#9658)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression tests fail with -R 3

4 participants