-
Notifications
You must be signed in to change notification settings - Fork 844
TS-4796 Change UnixNetHandler to always bubble up epoll errors to the VConnection #947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
80156d5 to
2c76cd2
Compare
|
FreeBSD build failed! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/659/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/554/ for details. |
|
FreeBSD build failed! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/661/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/556/ for details. |
|
Seems reasonable. I can't think why the errors shouldn't bubble up. We may need to be a bit careful on handling them though. |
|
I don't think this is the right approach. First, Second, the Third, the logic in Finally, does this change mean that |
|
/cc @oknet |
|
EVENTIO_ERROR means EPOLLHUP | EPOLLERR | EPOLLPRI. EPOLLPRI means OOB or TCP URG is set. You will always receive EPOLLPRI with EPOLLIN. EPOLLHUP & EPOLLRDHUP EPOLLERR means the possible non-fatal errors on socket fd such as EAGAIN, EINTR, EWOULDBLOCK and fatal errors. When you receive EPLLERR, it means an error of socketfd and also there may be data before this error. Therefore we should call read() and write() to figure out the actual meanning of this error Currently, NetHandler try to perform read() & write() on the socket fd first. So the currently implement of NetHandler is enough to handle all of this and doesn't need to change. |
|
The InactivityCop will handle those NetVCs that read or write disabled. To disable read and write but still handle Timeout Event: To disable read and write and ignore Timeout Event: Note: |
|
The behavior I see on master (without this patch) is that ATS doesn't close the session when getting the RST. From digging that UnixNetHandler gets an EPOLLERR -- which attempts to add it to the write_ready queue, but since the vc isn't enabled for writing the HTTPTunnel isn't ever called to do the write. So from your last 2 comments it sounds like a more correct approach would be to immediately schedule a read/write to the socket -- to determine if there was in fact an error. It also sounds like you are suggesting that inactivity cop should get these sessions? In my tests I see that these sessions are either killed by the next attempt to write to the closed socket (when the origin sends more bytes later) or when the transaction hits the max inactivity timeout. Neither of these behaviors are what we want-- since I already got an RST from the client. So it seems that we need to somehow force-schedule a read/write in these error conditions-- do you have any pointers on how to do that? |
|
according your description, the HttpTunnel transfers the data from server session to client session, ATS received a RST from client and the connection between ATS and origin server is still alive. In your scenario, server session is the producer of HttpTunnel and client session is a consumer of HttpTunnel and cache session is the 2nd consumer if cache enabled. The HttpTunnel will not break if one consumer failed. The producer will re-enable all consumers if received READ_READY event. The producer is master and all consumers are slave. let the master trigger slaves and isolate the broken slave. Add a netvc into write_ready queue means there is non-fatal error(ex EAGAIN) at last write() call. It means write is enabled on the netvc. Only the returned errno from read() and write() is trustable. To close a netvc immediately if vc->read.vio._cont and vc->write.vio._cont both are NULL. |
|
comments for codes: The netvc always put in read_ready_list if it has EVENTIO_READ or EVENTIO_ERROR. for your case, read.enabled is 0 The vc is removed from read_ready_list. The below is my suggest: and |
|
That all sounds reasonable :) I just pushed a new commit here which does effectively what you are suggesting-- just both on the read and write side (as well as the few other little changes to make it work). I tested it and this covers my use-case (since we are still bubbling the error up when we aren't enabled) and should continue functioning the same for all the rest of the cases (since left them alone). |
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/691/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/587/ for details. |
iocore/net/UnixNet.cc
Outdated
| } | ||
| if (err != EAGAIN && err != EINTR) | ||
| vc->writeSignalError(this, err); | ||
| } else if (!vc->write.enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset read.error or write.error if non-fatal error.
keep read.enabled or write.enabled check.
suggest code:
else {
if (vc->write.error) {
...
if (err && err != EAGAIN && err != EINTR)
vc->writeSignalError(this, err);
else
vc->write.error = 0;
}
if (!vc->write.enabled) {
...
}
}
the same to READ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
73933e9 to
84934a8
Compare
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/606/ for details. |
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/711/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/607/ for details. |
iocore/net/UnixNet.cc
Outdated
| err = errno; | ||
| } | ||
| if (err != EAGAIN && err != EINTR) { | ||
| vc->readSignalError(this, err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my mistaken here. We should get vio mutex locked first before callback to SM. the new suggest :
else if (vc->read.enabled && vc->read.triggered || vc->read.error)
move "else if (vc->read.error) { ... }" into "vc->net_read_io()"
we can not access any member of NetVC ( e.g. vc->con.fd ) before get vio mutex locked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void
read_from_net(NetHandler *nh, UnixNetVConnection *vc, EThread *thread)
{
...
if (vc->closed) {
close_UnixNetVConnection(vc, thread);
return;
}
+ // if it is not enabled and got error from polling
+ if (!s->enabled && s->error) {
+ int err = 0, errlen = sizeof(int);
+ if (getsockopt(vc->con.fd, SOL_SOCKET, SO_ERROR, &err, (socklen_t *)&errlen) == -1) {
+ err = errno;
+ }
+ if (err != EAGAIN && err != EINTR) {
+ read_signal_error(this, vc, err);
+ }
+ return;
+ }
// if it is not enabled.
if (!s->enabled || s->vio.op != VIO::READ) {
read_disable(nh, vc);
return;
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense :) I've cleaned that up (left it as a separate commit for review-- although before merging I'll squash it). I did test that its still fixing my bug-- and it is, so I think we are set :)
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1089/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/981/ for details. |
… VConnection Before if the vcon wasn't read or write enabled errors would be swallowed. This leads to a variety of issues where the socket errors aren't dealt with immediately.
|
FreeBSD build successful! See https://ci.trafficserver.apache.org/job/Github-FreeBSD/1090/ for details. |
|
Linux build successful! See https://ci.trafficserver.apache.org/job/Github-Linux/982/ for details. |
|
An update to summarize updates from today: After looking into it the core issue with the crash I was seeing is that the read/write side of the VIOs where being called regardless of which side the error came in on. Really we should handle the error on the appropriate side (in or out)-- so instead of allowing us to do the read (for example) when read OR error, this PR changes it so that we only call that routine if it was on the read side. Then within that read handler we can check if there was an error. Secondly, instead of trying to unset the error state in the handler, I'm simply just setting it every time we get into the read/write blocks to the appropriate values. So, the PR is now updated (with the patch I've tested) and ready for merge! |
|
Looks good to me! |
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR which caused by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
After #947 (c1ac5f) and #1522 (a128d5) , the EVENT_ERROR which caused by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
After #947 (c1ac5f) and #1522 (a128d5) , the EVENT_ERROR which caused by EPOLLERR will be sent to read.vio._cont first and then write.vio._cont. The reader SM could close or shutdown(WRITE) the VC, but we do not check these operations before we callback write.vio._cont. The SM would received EVENT_ERROR twice if write.vio._cont == read.vio._cont. (cherry picked from commit aee3f3b)
…s to the VConnection" this reverts PRs apache#1559, apache#1522 and apache#947 This reverts commit c1ac5f8.
This reverts PRs apache#1559, apache#1522 and apache#947 PR apache#947 made the HTTP state machine unstable and lead to crashes in production like apache#1930 apache#1559 apache#1522 apache#1531 apache#1629 This reverts commit c1ac5f8.
Before if the vcon wasn't read or write enabled errors would be swallowed. This leads to a variety of issues where the socket errors aren't dealt with immediately.