Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to make asynchttpserver better; fixes #15925; [backport:1.0] #15957

Merged
merged 6 commits into from
Nov 13, 2020

Conversation

Araq
Copy link
Member

@Araq Araq commented Nov 13, 2020

No description provided.

@Araq Araq requested a review from dom96 November 13, 2020 13:51
@disruptek
Copy link
Contributor

And just like that, #12325 was fixed. 👍

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

@disruptek Yeah, I had a vague memory of your PR. Good we're finally sorting it out.

@ringabout
Copy link
Member

There is an issue releated to processClient: #3324

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the problem this is solving then I think you can simplify this and make it clearer to the users:

  • maxClients argument to AsyncHttpServer
  • increment a clientsCount counter in processClient
  • decrement after connection is closed
  • don't accept unless clientsCount < maxClients.

Comment on lines +36 to +41
## while true:
## if server.shouldAcceptRequest(5):
## var (address, client) = await server.socket.acceptAddr()
## asyncCheck processClient(server, client, address, cb)
## else:
## poll()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is under "basic usage", you shouldn't complicate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People copy&paste this section, it needs to be reasonably complete rather than simple. In fact, I should probably add some basic error handling there too.

proc serve*(server: AsyncHttpServer, port: Port,
callback: proc (request: Request): Future[void] {.closure, gcsafe.},
address = "";
assumedDescriptorsPerRequest = 5) {.async.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very arbitrary and therefore likely error prone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much better than no limit and limits are always arbitrary.

##
## You should prefer to call `acceptRequest` instead with a custom server
## loop so that you're in control over the error handling and logging.
listen server, port, address
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here, unless I'm missing something.

Copy link
Member Author

@Araq Araq Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be there thanks to bad API design (it should be part of newAsyncHttpServer IMHO but this would be a breaking change).

Copy link
Contributor

@dom96 dom96 Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this literally break code? You'll have code in the wild calling listen twice now.

Edit: oh, I see, didn't realise you introduced the listen.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

If I understand the problem this is solving then I think you can simplify this and make it clearer to the users:

This wouldn't work as well because serving a single async request can require multiple different FDs, say if you load data from a DB before serving the request. It's much more robust to ask the event loop instead.

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

I think I agree with @Araq here. At first my reaction was kind of like @dom96, but then I thought about it a little more. The current way or its mathematical complement (clients limit & maxDescriptors() subtraction) way are basically equivalent control, but a user is probably more likely to be able to estimate/bound how many new descriptors a worst case request might need. So, if it were the max clients way, most users wanting to be safe would be doing that same subtraction to go back to the complementary sense, anyway. It is also easier to default "spare fds"/show in example code this way.

There might be a "silent underprovisioning" argument to be made. The least max fds I've ever seen in 30 years of using Unix is 64, HP-UX in the 1990s, IIRC. So, std(in|out|err)+cpl overhead + 5 headroom still leaves (probably) ~54 concurrent connections in the crazy worst case. That's actually still a lot of concurrent connections, TBH. Except for debugging/testing like for this problem, people never lower such small fd limits. So the "spare fds way" doesn't really risk underprovision, and the complex guesstimation above kind of exhibits what happens to careful users in the "complementary coordinates". Such guestimation seems ugly & more error prone..You may be a forked process inheriting a bunch of fds, too. So, I think specifying from the spares/headroom side is best.

I do think having a 5 line API call to crank up the soft max to the hard max (somewhere) would be helpful..basically just:

import posix # add more error checks, of course
var fdLim: RLimit
discard getrlimit(RLIMIT_NOFILE, fdLim)
fdLim.rlim_cur = fdLim.rlim_max
discard setrlimit(RLIMIT_NOFILE, fdLim)

Maybe call it softToHardDescriptorLimit or something like that. (In truth the only errors possible here are EFAULT for a bad fdLim addr and EINVAL for a bad RLIMIT_NOFILE, but it's good to check all one's errors). May even make sense to layer it, like softToHardLimit(resource = RLIMIT_NOFILE) so it could work for any of the RLIMIT_*.

@Araq Araq merged commit 562c627 into devel Nov 13, 2020
@Araq Araq deleted the araq-async-madness branch November 13, 2020 19:57
proc serve*(server: AsyncHttpServer, port: Port,
callback: proc (request: Request): Future[void] {.closure, gcsafe.},
address = "";
assumedDescriptorsPerRequest = 5) {.async.} =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor assumedDescriptorsPerRequest = 5 with the constant used in

proc shouldAcceptRequest*(server: AsyncHttpServer;
                          assumedDescriptorsPerRequest = 5): bool {.inline.} =

@timotheecour
Copy link
Member

@Araq good PR but please always squash and merge, this PR will break git bisect workflows on which I and others rely a lot to find regressions/accidental bugfixes etc, because it keeps intermediate broken commits

image

can we do the proposal in #8664 would prevent such issues in future:

in nim-lang/Nim/settings, disable "Allow merge commits", as follows: [...]

(and even force squash and merge); in the rare cases where a squash and merge isn't desirable, the other merge strategies could be enabled temporarily

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't work as well because serving a single async request can require multiple different FDs, say if you load data from a DB before serving the request. It's much more robust to ask the event loop instead.

Then perhaps it should be part of asyncdispatch itself?

##
## You should prefer to call `acceptRequest` instead with a custom server
## loop so that you're in control over the error handling and logging.
listen server, port, address
Copy link
Contributor

@dom96 dom96 Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this literally break code? You'll have code in the wild calling listen twice now.

Edit: oh, I see, didn't realise you introduced the listen.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

Won't this literally break code? You'll have code in the wild calling listen twice now.

But previously listen wasn't exported so I don't see how this could happen.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

Then perhaps it should be part of asyncdispatch itself?

The APIs that are added are part of asyncdispatch, yes.

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

The best place to limit fd creation for a network service is just before accept(2). Then the usual listen(2) backlog will just keep the requester clients in limbo until fds have been freed up and then accept(2) as usual. I believe that's what @Araq did.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

(and even force squash and merge); in the rare cases where a squash and merge isn't desirable, the other merge strategies could be enabled temporarily

I know, it was a simple mistake. Sometimes github defaults to the wrong button for some reason.

@dom96
Copy link
Contributor

dom96 commented Nov 13, 2020

The APIs that are added are part of asyncdispatch, yes.

No, I mean the handling of FDs. Wouldn't it be better to have it handled in accept automatically?

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

There could be a safeAccept, I suppose, that no-ops if there aren't enough spare fds, but probably the event loop then also needs to be aware of a no-op return, as opposed to just successful or an error case to raise.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

Maybe, but this would be an even more invasive change. Also: People requested an API like mine for asynchttpserver before, nobody requested this feature for accept.

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

Technically, this same problem exists for literally every system call returning a file descriptor. So, some kind of safeOpen template-y/-ish system might be warranted someday/somehow, but I think that could live side-by-side with @Araq's API.

@dom96
Copy link
Contributor

dom96 commented Nov 13, 2020

Then the usual listen(2) backlog will just keep the requester clients in limbo until fds have been freed up and then accept(2) as usual.

Huh, this implies that the OS should handle this for us then. If so why do we need a separate mechanism in the stdlib?

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

The OS still limits your open fds. What I meant is that the clients will not see an ECONNREFUSED-type situation until the backlog buffer gets filled up. So, just not doing the accept syscall is more graceful, deferring failure if there is just a very transient spike in connections.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

Technically, this same problem exists for literally every system call returning a file descriptor. So, some kind of safeOpen template-y/-ish system might be warranted someday/somehow, but I think that could live side-by-side with @Araq's API.

That's true but please consider:

This PR with an API change will be backported to version 1.0.x. I consider it a critical omission -- more critical than other bugfixes that we backported -- but at the same time I tried my best to keep the changes to a minimum.

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

Oh, sure. I was just responding to the @dom96 idea of safeAccept. Re: the listen backlog/overall dynamics, it is also true that few logical situations allow you to just punt to a no-op and try again later. So, utility of safeOpen may be pretty limited.

@dom96
Copy link
Contributor

dom96 commented Nov 13, 2020

Does anyone have a real repro for issue #15925? All I see are synthetic code samples that don't seem realistic to me. I'm quite confused about how this would crop up in real code without it being due to a real FD leak.

@Araq
Copy link
Member Author

Araq commented Nov 13, 2020

Does anyone have a real repro for issue #15925?

As far as I can know this affects our very own forum software. (Still investigating though.)

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

The origin case was in the Forum thread. The wrk http demon workhorse tester with 1024 parallel connections (-c 1024). That is a lot more than is typical organically, unless you are under a DoS attack or a Big Tech company. (EDIT: but our stuff should still handle this scenario.)

@dom96
Copy link
Contributor

dom96 commented Nov 13, 2020

Well... we just tested his exploit and it doesn't do anything. Furthermore, I checked the server and it has been running for 2 months without crashes.

@dom96
Copy link
Contributor

dom96 commented Nov 13, 2020

So my recommendation is to change this slightly:

  • It shouldn't be the default
  • The docs should be reverted to the basic example
  • Docs should be provided showing how to use this new API

Comment on lines +336 to +337
proc acceptRequest*(server: AsyncHttpServer, port: Port,
callback: proc (request: Request): Future[void] {.closure, gcsafe.}) {.async.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the port here is unused

Copy link
Member Author

@Araq Araq Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good one. Will do a follow-up PR later.

@c-blake
Copy link
Contributor

c-blake commented Nov 13, 2020

I'm not sure how the Forum server process never died, but you also said server not server process. So, maybe ambiguous? I didn't look at the Forum code, anyway.

I do think it is not great example code/library behavior to just exhaust file descriptors. Reserving at least a few for the user code just makes very basic sense to me. I'm not sure why anyone would be against it. You need at least 1 free for the accept itself to even work.

My simpler test #15925 (comment) behaved as one would expect with that wrk http test prog from strace s that I did. I haven't yet tried it with the recent @Araq change. Could be wrinkles.

@Araq
Copy link
Member Author

Araq commented Nov 14, 2020

I'm not sure how the Forum server process never died, but you also said server not server process. So, maybe ambiguous? I didn't look at the Forum code, anyway.

The Forum process itself never died.

@c-blake
Copy link
Contributor

c-blake commented Nov 14, 2020

I don't know what's going on with the Forum recovery from overload, but this simple program:

import asynchttpserver, asyncdispatch, asyncnet, posix

var fdLim = RLimit(rlim_cur: 7, rlim_max: 1024)
discard setrlimit(RLIMIT_NOFILE, fdLim)
const
  s = "HTTP/1.1 200\r\ncontent-type: text/html\r\ncontent-length: 3\r\n\r\nHi\n"
proc svc(req: Request, staticDir="") {.async.} = await req.client.send(s)
proc cb(req: Request) {.async.} = await req.svc("static")

var server = newAsyncHttpServer()
waitFor server.serve(Port(8080), cb)

dies immediately with the old code even for wrk --latency -d 1s -t 1 -c 3 http://localhost:8080/ yet runs fine with wrk --latency -d 1s -t 1 -c 1024 http://localhost:8080/ with the new code. Raising an avoidable exception under overload (without an easy plan to clear the condition) seems bad while gracefully handling massive overload seems good. So, I think the new code is both more robust & a better example of managing a scarce (on Unix) resource. Like all code, it may be imperfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants