a performance issue for epoll idle #4

lihuiba · 2017-04-23T06:49:06Z

I came across a performance issue in epoll mode, when there were thousands concurrent connections. Profiling shows that _st_epoll_dispatch() consumed a lot of CPU.

After reviewing the function, I think I've found the reason: there's a loop that enumerates ALL threads in the I/O queue.

for (q = _ST_IOQ.next; q != &_ST_IOQ; q = q->next) {

As I'm using one thread per connection model, I believe this loop make epoll mode degraded effectively to select mode.

The text was updated successfully, but these errors were encountered:

winlinvip · 2017-04-25T03:58:01Z

I find the code in

state-threads/event.c

Line 1281 in d04dcb5

for (q = _ST_IOQ.next; q != &_ST_IOQ; q = q->next) {

When there are some epoll events(the nfd >0), it will iterate the _ST_IOQ, a loop link list. The _ST_IOQ is the waiting queue, fd will push in queue in st_poll in

state-threads/sched.c

Line 81 in d04dcb5

_ST_ADD_IOQ(pq);

I don't know why need to iterate all blocked fds? How to fix this?

winlinvip · 2017-04-25T04:00:03Z

It seems to find the thread which is waiting on fd, and switch to the thread.

lihuiba · 2017-04-25T07:56:18Z

Yes it does, but I don't think it's necessary.

lihuiba · 2017-04-25T07:59:32Z

I believe we can store thread info in _st_epoll_data[], and resume the threads without iterating the IOQ.

xiaosuo · 2017-06-02T13:15:06Z

I do think there aren't many users now, so nobody cares.

lihuiba · 2017-06-03T02:53:54Z

I'm a heavy user of ST. Please make sure there is a problem, and I'll arrange a patch.

xiaosuo · 2017-06-03T04:40:44Z

Yes. It is a issue. But I am afraid that your way doesn't work. There maybe more threads waiting on a fd, and there isn't any reference to the monitoring fds. I think you'd better extend _epoll_fd_data, and record _st_pollq in _epoll_fd_data when inserting it into IOQ.

BTW: Maybe EPOLLONESHOT and EPOLLET can be used to get more better performance.

winlinvip · 2017-06-03T09:21:02Z

For performance issue, please write a test example and do some benchmarks, we SHOULD NEVER guess about it.

xiaosuo · 2017-06-03T10:12:19Z

FYI: I just cooked a patch for this issue, and will submit a PR after testing.

@winlinvip Is there any unit testing case?

xiaosuo · 2017-06-03T16:31:39Z

See #5 @lihuiba @winlinvip

winlinvip · 2017-06-04T03:12:36Z

@xiaosuo Thanks for your work and PR, that's great. And can you write a example code, which illustrate the problem? For example, we can write a program, it use 90% CPU or there is a performance report indicates the bottleneck. Then we patch ST with your PR, it use less CPU or the performance report finger it out.

Please write a benchmark to prove the PR really fixed the problem.

xiaosuo · 2017-06-04T03:52:08Z

@lihuiba reported this issue. I think he will test the PR with his workload.

I tested the server in the example directory, and ab reported slight improvements on both QPS and latency.

lihuiba · 2017-06-04T06:01:20Z

@winlinvip @xiaosuo I found the issue in my production server, where there were lots of idle connections. The server consumed several times more CPU resource than expected. I believe the situation could be easily reproduced, and I'll try the patch in a test environment, before launching it to my production server.

winlinvip · 2017-06-04T07:16:16Z

@lihuiba Please test it, thanks~ If you can write a simple test example, it will be better, because it's necessary for any performance issue.

xiaosuo · 2017-06-08T13:58:01Z

@lihuiba what is the result？

lihuiba · 2017-06-09T05:26:58Z

@xiaosuo I'm on sth else these days, and I think I'll do the tests next week.

winlinvip · 2017-06-10T00:39:11Z

@lihuiba 👍 #5

lihuiba · 2017-06-14T07:36:30Z

the patched ST is much slower than before, due to the added functions, as the figure below:

lihuiba · 2017-06-14T07:37:32Z

there were 10,000 idle connections (from a custom tool) in the test, and only 1 active connection from ab.

xiaosuo · 2017-06-14T08:42:07Z

@lihuiba Are many coroutines monitoring a FD?

Is a coroutine monitoring lots of FDs?

lihuiba · 2017-06-15T03:29:31Z

@xiaosuo No. One coroutine is responsible for a single connection.

xiaosuo · 2017-06-15T03:58:34Z

@lihuiba Would you provide the testing codes?

lihuiba · 2017-06-15T06:21:32Z

1. diff of server.c:

@@ -942,7 +942,7 @@ void handle_session(long srv_socket_index, st_netfd_t cli_nfd)
    struct in_addr *from = st_netfd_getspecific(cli_nfd);
  
 -  if (st_read(cli_nfd, buf, sizeof(buf), SEC2USEC(REQUEST_TIMEOUT)) < 0) {
 +  if (st_read(cli_nfd, buf, sizeof(buf), ST_UTIME_NO_TIMEOUT) < 0) {
      err_sys_report(errfd, "WARN: can't read request from %s: st_read",
  		   inet_ntoa(*from));
      return;

2. idle connections:

#include <sys/types.h>
#include <sys/socket.h>
#include <fcntl.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <sys/time.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>

int Socket(const char *host, int clientPort)
{
    int sock;
    unsigned long inaddr;
    struct sockaddr_in ad;
    struct hostent *hp;
    
    memset(&ad, 0, sizeof(ad));
    ad.sin_family = AF_INET;

    inaddr = inet_addr(host);
    if (inaddr != INADDR_NONE)
        memcpy(&ad.sin_addr, &inaddr, sizeof(inaddr));
    else
    {
        hp = gethostbyname(host);
        if (hp == NULL)
            return -1;
        memcpy(&ad.sin_addr, hp->h_addr, hp->h_length);
    }
    ad.sin_port = htons(clientPort);
    
    sock = socket(AF_INET, SOCK_STREAM, 0);
    if (sock < 0)
        return sock;
    if (connect(sock, (struct sockaddr *)&ad, sizeof(ad)) < 0)
        return -1;
    return sock;
}

int  main(int argc, char **argv)
{   
    int idleTime = atoi(argv[4]);
    int numberOfConnection = atoi(argv[3]);
    int clientPort = atoi(argv[2]);
    for (int i = 0; i < numberOfConnection; ++i) {
        int s = Socket(argv[1], clientPort);
        if (s < 0)
            printf("error connect to %d\n", clientPort);
    }
    sleep(idleTime);
    exit(0);
}

3. ab:

ab -n 1000  -c 1  localhost:8080

xiaosuo · 2017-06-15T08:22:38Z

It is strange. I ran your code, and got an opposite conclusion.

./server -t 1 -p 1 -b 127.0.0.1:8888 -l . -i

Could you run your code again?

lihuiba · 2017-06-15T11:37:40Z

I found it quite slow to create large number of connections to the examples/server.c, about 1 second per connection. So I wrote a minimal server myself:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <fcntl.h>
#include <signal.h>
#include <pwd.h>
#include "st.h"

void* handle_connection(void* s_)
{
  st_netfd_t s = s_;

  char buf[512];
  st_read(s, buf, sizeof(buf), ST_UTIME_NO_TIMEOUT);

  static char resp[] =
    "HTTP/1.0 200 OK\r\n"
    "Content-type: text/html\r\n"
    "Connection: close\r\n"
    "\r\n"
    "<H2>It worked!</H2>\n";
  st_write(s, resp, sizeof(resp) - 1, ST_UTIME_NO_TIMEOUT);

  st_netfd_close(s);
}

int main()
{
  st_set_eventsys(ST_EVENTSYS_ALT);
  st_init();
  st_randomize_stacks(1);

  int ret, sock = socket(PF_INET, SOCK_STREAM, 0);

  struct sockaddr_in serv_addr;
  memset(&serv_addr, 0, sizeof(serv_addr));
  serv_addr.sin_family = AF_INET;
  serv_addr.sin_addr.s_addr = INADDR_ANY;
  serv_addr.sin_port = htons(8086);
  ret = bind(sock, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
  ret = listen(sock, 16);

  st_netfd_t server = st_netfd_open_socket(sock);

  while(1)
  {
    st_netfd_t c = st_accept(server, NULL, 0, ST_UTIME_NO_TIMEOUT);
    st_thread_create(handle_connection, c, 0, 0);
  }
}

I created 10,000 idle connections in the test. and I run ab as:

ab -n 10000  -c 1  http://localhost:8086/

And I got a very promising result:

181161 root      20   0  765968  56244    408 R  84.2  0.1   0:21.25 server
Time taken for tests:   4.679 seconds

and

184781 root      20   0  766140  56460    408 S  45.7  0.1   0:02.50 server-patched
Time taken for tests:   0.561 seconds

The result showed that the server with pathed ST was 8.3 times (4.679/0.561) faster than before, and consumed much less CPU resource.

Greate job!

lihuiba · 2017-06-15T11:44:23Z

perf top of the patched:

lihuiba · 2017-06-15T11:45:38Z

perf top of the original one:

xiaosuo · 2017-06-15T11:46:57Z

Cheers!

winlinvip · 2017-06-28T03:00:06Z

👍 So the patch works well when there are lots of idle connections, right? Could you please write more use scenarios, such as:

Please use large connections, like 10k or 30k connections, with only a small amount alive connections. It's like your example, but please use more connections.
All connections are active, for example, there are 10k connections. And please test about the timeout and cond feature, for example, some connections are sleeping while others are busy.

Maybe you should let ST to allocate memory in heap instead of stack for large amount of corotines:

make EXTRA_CFLAGS="-DMALLOC_STACK" linux-debug

winlinvip · 2017-06-28T03:13:54Z

👍 @xiaosuo @lihuiba
It's awesome, really great job!

lihuiba · 2017-11-01T07:10:15Z

Have you merged the PR?

winlinvip · 2018-01-02T02:59:50Z

I'm still waiting for the test for active connections 😃

winlinvip · 2018-01-02T03:30:12Z

I have merge the PR to https://github.com/ossrs/state-threads/tree/features/xiaosuo/epoll
The merged code is here: srs...features/xiaosuo/epoll
Please test it and file PR to it for updates.

winlinvip · 2021-02-24T01:53:41Z

最近在做perf分析时，发现这个切换会比较占CPU，这个问题值得解决。

winlinvip · 2021-10-01T00:40:07Z

需要补全utest，覆盖到正常和异常逻辑，才能合并，否则上线后出现偶现的问题，会对质量造成非常大的影响。

utest框架已经完成了：https://github.com/ossrs/state-threads#utest-and-coverage

xiaosuo mentioned this issue Jan 2, 2018

epoll: Don't iterate all the fds when using epoll #5

Open

winlinvip changed the title ~~a performance issue for epoll~~ a performance issue for epoll idle Jan 26, 2021

enochi mentioned this issue Jan 30, 2023

如果协程释放将内存也同时释放,为什么会崩溃 #35

Closed

lihuiba mentioned this issue Sep 17, 2024

Recommend an alternative coroutine implementation ossrs/srs#4179

Closed

a performance issue for epoll idle #4

a performance issue for epoll idle #4

Comments

lihuiba commented Apr 23, 2017

winlinvip commented Apr 25, 2017

winlinvip commented Apr 25, 2017 • edited Loading

lihuiba commented Apr 25, 2017

lihuiba commented Apr 25, 2017

xiaosuo commented Jun 2, 2017

lihuiba commented Jun 3, 2017

xiaosuo commented Jun 3, 2017

winlinvip commented Jun 3, 2017 • edited Loading

xiaosuo commented Jun 3, 2017

xiaosuo commented Jun 3, 2017

winlinvip commented Jun 4, 2017

xiaosuo commented Jun 4, 2017 • edited Loading

lihuiba commented Jun 4, 2017

winlinvip commented Jun 4, 2017

xiaosuo commented Jun 8, 2017

lihuiba commented Jun 9, 2017

winlinvip commented Jun 10, 2017 • edited Loading

lihuiba commented Jun 14, 2017

lihuiba commented Jun 14, 2017

xiaosuo commented Jun 14, 2017

lihuiba commented Jun 15, 2017

xiaosuo commented Jun 15, 2017

lihuiba commented Jun 15, 2017 • edited Loading

1. diff of server.c:

2. idle connections:

3. ab:

xiaosuo commented Jun 15, 2017

lihuiba commented Jun 15, 2017

lihuiba commented Jun 15, 2017

lihuiba commented Jun 15, 2017

xiaosuo commented Jun 15, 2017

winlinvip commented Jun 28, 2017 • edited Loading

winlinvip commented Jun 28, 2017

lihuiba commented Nov 1, 2017

winlinvip commented Jan 2, 2018

winlinvip commented Jan 2, 2018 • edited Loading

winlinvip commented Feb 24, 2021

winlinvip commented Oct 1, 2021 • edited Loading

winlinvip commented Apr 25, 2017 •

edited

Loading

winlinvip commented Jun 3, 2017 •

edited

Loading

xiaosuo commented Jun 4, 2017 •

edited

Loading

winlinvip commented Jun 10, 2017 •

edited

Loading

lihuiba commented Jun 15, 2017 •

edited

Loading

winlinvip commented Jun 28, 2017 •

edited

Loading

winlinvip commented Jan 2, 2018 •

edited

Loading

winlinvip commented Oct 1, 2021 •

edited

Loading