Skip to content

Conversation

@bneradt
Copy link
Contributor

@bneradt bneradt commented Mar 7, 2024

When I ported #8747 into HTTP/2 to origin logic, I missed setting the watermark on outgoing connections. This essentially made it so that regardless of configuration, POST requests or the like would always send every little bit of DATA frames regardless of the configured watermark. The watermark was effectively zero on the server side. This patch fixes that by appying the configured watermark.

This patch also fixes the HTTP/2 write_time_threshold configuration so it consistently wakes up ATS to transmit the frames held up behidn write_buffer_threshold. If the write_buffer_threshold is not met, then I noticed ATS never woke up to send the accumulated frames until a xmit with a flush was sent at the end of the transaction. This ensures that the frames are transmitted per the write_time_threshold by scheduling an event for the retransmission.


For Review

I've kept the parts of this review in separate commits to hopefully help the review of this.

@bneradt bneradt added the HTTP/2 label Mar 7, 2024
@bneradt bneradt added this to the 10.1.0 milestone Mar 7, 2024
@bneradt bneradt requested a review from maskit March 7, 2024 15:42
@bneradt bneradt self-assigned this Mar 7, 2024
@maskit
Copy link
Member

maskit commented Mar 7, 2024

This patch also fixes the HTTP/2 write_time_threshold configuration so it consistently wakes up ATS to transmit the frames held up behidn write_buffer_threshold. If the write_buffer_threshold is not met, then I noticed ATS never woke up to send the accumulated frames until a xmit with a flush was sent at the end of the transaction. This ensures that the frames are transmitted per the write_time_threshold by scheduling an event for the retransmission.

Is it for the case that there is long pauses during data transfer but the stream is not tunneling?

I'm not keen on adding the new event for retransmission, and wondering if we can use the flush flag instead. Specifically:

  • Add "push" flag to Http2DataFrame to indicate that the frame should not be buffered (like TCP PUSH flag)
  • Set push flag on if _reader->is_read_avail_more_than(0) gets false in write_to_buffer()
  • Set flush flag on if the push flag is on

I'm not sure if this really works. It may kill the thresholds if the origin/cache is slow and the reader always becomes empty.

The current code for the retransmission event seems to cause a crash, if the session is destroyed before the event gets processed, although the chance is low. The scheduled event probably needs to be canceled when flush() is called and when the session closes, at a minimum.

@bneradt bneradt force-pushed the fix_http2_buffering_timeout branch from 9660d90 to 82f7121 Compare March 7, 2024 18:59
@bneradt
Copy link
Contributor Author

bneradt commented Mar 7, 2024

Thank you for the thoughts.

This patch also fixes the HTTP/2 write_time_threshold configuration so it consistently wakes up ATS to transmit the frames held up behidn write_buffer_threshold. If the write_buffer_threshold is not met, then I noticed ATS never woke up to send the accumulated frames until a xmit with a flush was sent at the end of the transaction. This ensures that the frames are transmitted per the write_time_threshold by scheduling an event for the retransmission.

Is it for the case that there is long pauses during data transfer but the stream is not tunneling?

Definitely not for the tunneling case - we don't attempt any buffering when tunneling. What we saw was not actually long pauses, but kind of the opposite. An internal property had to turn the write buffering off because they had many spread out but very small DATA frames which didn't exceed the write buffer threshold, but the write timeout never appropriately triggered after 10ms to flush it out. This kind of "many small DATA frames" is what the new autest reproduces. Before this patch, the frames did eventually get flushed when transactions completed since we force flush on END_STREAM (and probably other cases), but their perceived experience was "jerky" rather than smooth. This patch makes it more certain that we wake up to send the data.

It is possible, btw, that this is an HTTP/2 to origin only issue. I believe the property was using h2 on the origin side. But I'm not sure and unfortunately it has been a few months since I helped them with the issue.

The current code for the retransmission event seems to cause a crash, if the session is destroyed before the event gets processed, although the chance is low. The scheduled event probably needs to be canceled when flush() is called and when the session closes, at a minimum.

Yes, thank you for this feedback. That's the conclusion I came to as well and changed the logic to store the event and cancel when appropriate (both when a flush happens, and when the appropriate owning objects are destroyed).

@maskit
Copy link
Member

maskit commented Mar 7, 2024

I still don't understand the problematic case. Are you saying write_time_threshold is not working? As long as WRITE_READY event is issued appropriately, the data in the buffer should be flushed even without END_STREAM flag at the end. I don't get why new event is needed. In other words, the new XMIT event could try to send data where the connection or session is not really WRITE_READY.

A possible bug is that WRITE_READY is not scheduled for some reason. Maybe reenable() is not called.

@bneradt
Copy link
Contributor Author

bneradt commented Mar 7, 2024

I still don't understand the problematic case. Are you saying write_time_threshold is not working?

Yes, sorry if I wasn't clear. To be concise: Without this patch, write_size_threshold can hold up a frame, but write_time_threshold (default of 100ms) is not functionally, or at least not consistenly, taking place.

The new autest demonstrates this by sending thousands of small DATA frames, one every ms. The server should get chunks of data at least every write_size_threshold (set to 10ms for the test). Instead, without this patch, it gets a DATA frame or a few at the beginning, then none of the rest till the end.

As long as WRITE_READY event is issued appropriately, the data in the buffer should be flushed even without END_STREAM flag at the end.

A possible bug is that WRITE_READY is not scheduled for some reason. Maybe reenable() is not called.

Calling flush in the new retransmit event will call reenable. Adding this event scheduled in write_time_threshold guarantees that reenable will be called via flush.

The new event seems like a consistent way to guarantee that the data gets flushed in the appropriate amount of time, within write_time_threshold. But if there's a better way to accomplish this without the event, I'm definitely open to that. Why is the WRITE_READY event guaranteed to be triggered presently about write_time_threshold ms after write_size_threshold holds up a packet? Currently that doesn't happen, but maybe there's something I can do to address that?

@bneradt bneradt force-pushed the fix_http2_buffering_timeout branch 3 times, most recently from a981c01 to 7f2235d Compare March 25, 2024 20:04
@bryancall bryancall self-requested a review March 25, 2024 22:38
@bneradt bneradt force-pushed the fix_http2_buffering_timeout branch from 7f2235d to 671c222 Compare March 26, 2024 01:27
@bneradt
Copy link
Contributor Author

bneradt commented Mar 28, 2024

[approve ci fedora]

@bneradt bneradt force-pushed the fix_http2_buffering_timeout branch from 671c222 to c57a58a Compare April 2, 2024 15:49
Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @moonchen , and now I get why we have to do this. I can live with this change, but my view is that we have to do this because NetVC does not provide a generic way to deal with the situation. We only have the buffering stuff in H2 at this moment, but buffering is not an H2 specific thing and it can happen on any application. I'd say an ideal way to deal with the issue is to add the functionality to NetVC / iocore layer.

@bneradt
Copy link
Contributor Author

bneradt commented Apr 3, 2024

Thank you @maskit and @moonchen for the thoughtful review comments and feedback.

@bneradt bneradt merged commit 66940f5 into apache:master Apr 3, 2024
@bneradt bneradt deleted the fix_http2_buffering_timeout branch April 3, 2024 21:01
@cmcfarlen cmcfarlen modified the milestones: 10.1.0, 10.0.0 Apr 5, 2024
@cmcfarlen
Copy link
Contributor

Cherry-picked to v10.0.x

cmcfarlen pushed a commit that referenced this pull request Apr 5, 2024
When I ported #8747 into HTTP/2 to origin logic, I missed setting the watermark on outgoing connections. This essentially made it so that regardless of configuration, POST requests or the like would always send every little bit of DATA frames regardless of the configured watermark. The watermark was effectively zero on the server side. This patch fixes that by appying the configured watermark.

This patch also fixes the HTTP/2 write_time_threshold configuration so it consistently wakes up ATS to transmit the frames held up behidn write_buffer_threshold. If the write_buffer_threshold is not met, then I noticed ATS never woke up to send the accumulated frames until a xmit with a flush was sent at the end of the transaction. This ensures that the frames are transmitted per the write_time_threshold by scheduling an event for the retransmission.

Here's a list of changes made in this patch:

* Adding a write_threshold test
* Set buffer watermark for origin h2 connections
* HTTP2_SESSION_EVENT_XMIT -> HTTP2_SESSION_EVENT_PRIO
* schedule flushing event for write_time_threshold

(cherry picked from commit 66940f5)
bneradt added a commit to bneradt/trafficserver that referenced this pull request Apr 8, 2024
When working on the patch for apache#11143, I accidentally left in some Note
log messages that should have been cleaned up before I merged in the
changes. This removes those noisy Note messages.
bneradt added a commit that referenced this pull request Apr 8, 2024
When working on the patch for #11143, I accidentally left in some Note
log messages that should have been cleaned up before I merged in the
changes. This removes those noisy Note messages.
cmcfarlen pushed a commit that referenced this pull request Apr 9, 2024
When working on the patch for #11143, I accidentally left in some Note
log messages that should have been cleaned up before I merged in the
changes. This removes those noisy Note messages.

(cherry picked from commit c0a42a6)
JosiahWI pushed a commit to JosiahWI/trafficserver that referenced this pull request Jul 8, 2024
When working on the patch for apache#11143, I accidentally left in some Note
log messages that should have been cleaned up before I merged in the
changes. This removes those noisy Note messages.

(cherry picked from commit c0a42a6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: picked-10.0.0

Development

Successfully merging this pull request may close these issues.

3 participants