Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degrading performance over time... #1821

Closed
vgoklani opened this issue Apr 17, 2017 · 44 comments
Closed

Degrading performance over time... #1821

vgoklani opened this issue Apr 17, 2017 · 44 comments
Labels
Milestone

Comments

@vgoklani
Copy link

vgoklani commented Apr 17, 2017

I wrote a quick AsyncScraper class below:

import logging, datetime, time
import aiohttp
import asyncio
import uvloop

# asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger.addHandler(logging.StreamHandler())

class AsyncScraper(object):
	headers = {"User-Agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'}
	def __init__(self, max_connections=1000, timeout=10):
		self.max_connections = max_connections
		self.timeout = timeout
		
	async def get_response(self, url, session):
		with aiohttp.Timeout(timeout=self.timeout):
			async with session.get(url, allow_redirects=True, headers=AsyncScraper.headers, timeout=self.timeout) as response:
				try:
					content = await response.text()
					return {'error': "", 'status': response.status, 'url':url, 'content': content, 'timestamp': str(datetime.datetime.utcnow())}
				except Exception as err:
					return {'error': err, 'status': "", 'url':url, 'content': "", 'timestamp': str(datetime.datetime.utcnow())}
				finally:
					response.close()

	def get_all(self, urls):
		loop = asyncio.get_event_loop()
		with aiohttp.ClientSession(loop=loop, connector=aiohttp.TCPConnector(keepalive_timeout=10, limit=self.max_connections, verify_ssl=False)) as session:
			tasks = asyncio.gather(*[self.get_response(url, session) for url in urls], return_exceptions=True)
			results = loop.run_until_complete(tasks)
			return results

def chunks(l, n):
	for i in range(0, len(l), n):
		yield l[i:i + n]

def process_urls(urls, chunk_size=1000):
	scraper = AsyncScraper()

	results = []
	t0 = time.time()
	for i, urls_chunk in enumerate(chunks(sorted(set(urls)), chunk_size)):
		t1 = time.time()
		result = scraper.get_all(urls_chunk)
		success_size = len( [_ for _ in result if ((isinstance(_, Exception) is False) and (_['status']==200)) ] )
		results.extend(result)
		logger.debug("batch {} => success: {} => iteration time: {}s =>  total time: {}s => total processed {}".format(i+1, success_size, time.time()-t1, time.time()-t0, len(results)))
	return results

and I've run into two main issues:

  1. If I pass in a flat list of URLs, say 100k (via the get_all method), I get flooded with errors:

    2017-04-17 15:50:53,541 - asyncio - ERROR - Fatal error on SSL transport
    protocol: <asyncio.sslproto.SSLProtocol object at 0x10d5439b0>
    transport: <_SelectorSocketTransport closing fd=612 read=idle write=<idle, bufsize=0>>
    Traceback (most recent call last):
    File "/Users/vgoklani/anaconda3/lib/python3.6/asyncio/sslproto.py", line 639, in _process_write_backlog
    ssldata = self._sslpipe.shutdown(self._finalize)
    File "/Users/vgoklani/anaconda3/lib/python3.6/asyncio/sslproto.py", line 151, in shutdown
    raise RuntimeError('shutdown in progress')
    RuntimeError: shutdown in progress

  2. I then batched the URLs in chunks of 1,000, and timed the response between batches. And I was clearly able to measure the performance decay over time (see below). Moreover, the number of errors increased over time... What am I doing wrong?

    iteration 0 done in 16.991s
    iteration 1 done in 39.376s
    iteration 2 done in 35.656s
    iteration 3 done in 19.716s
    iteration 4 done in 29.331s
    iteration 5 done in 19.708s
    iteration 6 done in 19.572s
    iteration 7 done in 29.907s
    iteration 8 done in 23.379s
    iteration 9 done in 21.762s
    iteration 10 done in 22.091s
    iteration 11 done in 22.940s
    iteration 12 done in 31.285s
    iteration 13 done in 24.549s
    iteration 14 done in 26.297s
    iteration 15 done in 23.816s
    iteration 16 done in 29.094s
    iteration 17 done in 24.885s
    iteration 18 done in 26.456s
    iteration 19 done in 27.412s
    iteration 20 done in 29.969s
    iteration 21 done in 28.503s
    iteration 22 done in 28.699s
    iteration 23 done in 31.570s
    iteration 26 done in 31.898s
    iteration 27 done in 33.553s
    iteration 28 done in 34.022s
    iteration 29 done in 33.866s
    iteration 30 done in 36.351s
    iteration 31 done in 40.060s
    iteration 32 done in 35.523s
    iteration 33 done in 36.607s
    iteration 34 done in 36.325s
    iteration 35 done in 38.425s
    iteration 36 done in 39.106s
    iteration 37 done in 38.972s
    iteration 38 done in 39.845s
    iteration 39 done in 40.393s
    iteration 40 done in 40.734s
    iteration 41 done in 47.799s
    iteration 42 done in 43.070s
    iteration 43 done in 43.365s
    iteration 44 done in 42.081s
    iteration 45 done in 44.118s
    iteration 46 done in 44.955s
    iteration 47 done in 45.400s
    iteration 48 done in 45.987s
    iteration 49 done in 46.041s
    iteration 50 done in 45.899s
    iteration 51 done in 49.008s
    iteration 52 done in 49.544s
    iteration 53 done in 55.432s
    iteration 54 done in 52.590s
    iteration 55 done in 50.185s
    iteration 56 done in 52.858s
    iteration 57 done in 52.698s
    iteration 58 done in 53.048s
    iteration 59 done in 54.120s
    iteration 60 done in 54.151s
    iteration 61 done in 55.465s
    iteration 62 done in 56.889s
    iteration 63 done in 56.967s
    iteration 64 done in 57.690s
    iteration 65 done in 57.052s
    iteration 66 done in 67.214s
    iteration 67 done in 58.457s
    iteration 68 done in 60.882s
    iteration 69 done in 58.440s
    iteration 70 done in 60.755s
    iteration 71 done in 58.043s
    iteration 72 done in 65.076s
    iteration 73 done in 63.371s
    iteration 74 done in 62.800s
    iteration 75 done in 62.419s
    iteration 76 done in 61.376s
    iteration 77 done in 63.164s
    iteration 78 done in 65.443s
    iteration 79 done in 64.616s
    iteration 80 done in 69.544s
    iteration 81 done in 68.226s
    iteration 82 done in 78.050s
    iteration 83 done in 67.871s
    iteration 84 done in 69.780s
    iteration 85 done in 67.812s
    iteration 86 done in 68.895s
    iteration 87 done in 71.086s
    iteration 88 done in 68.809s
    iteration 89 done in 70.945s
    iteration 90 done in 72.760s
    iteration 91 done in 71.773s
    iteration 92 done in 72.522s

The time here corresponds to the iteration time to process 1,000 URLs. Please advise. Thanks

@fafhrd91
Copy link
Member

fafhrd91 commented Apr 17, 2017 via email

@vgoklani
Copy link
Author

I'm using '2.0.5'. How do I check for the open tcp connection?

@fafhrd91
Copy link
Member

fafhrd91 commented Apr 17, 2017 via email

@vgoklani
Copy link
Author

I'm using python 3.6.0

@vgoklani
Copy link
Author

Am I making an obvious mistake in the code?

@fafhrd91
Copy link
Member

fafhrd91 commented Apr 17, 2017 via email

@vgoklani
Copy link
Author

Is there anything specific I can do to help you test? I'm not clear on what you want me to do with netstat?

@fafhrd91
Copy link
Member

does number of opened connections grow?

@alex-eri
Copy link
Contributor

if linux - run this in console while process first, second, midle and last chunks of urls

PID=pid_of_server
netstat -npt | grep $PID/ | wc -l

@mal
Copy link

mal commented May 23, 2017

I've just encountered this. 10k async reqs using a TCPConnecter with a limit of 100, initially starts well but immediately the connections seen via netstat drop away. It's a long tail but eventually it hits 0 at which point all remaining requests throw.

Using an asyncio.Semaphore to rate limit seems to workaround this issue (netstat reports at or close to the limit for the duration), although I'm relatively new to aio python, so I'm unsure at what cost, if any.

Edit: Seen on Ubuntu 16.04 with Python 3.5.2 using:

aiohttp==2.0.7
appdirs==1.4.3
async-timeout==1.2.1
chardet==3.0.3
multidict==2.1.5
packaging==16.8
pyparsing==2.2.0
six==1.10.0
yarl==0.10.2

@argaen
Copy link
Member

argaen commented May 23, 2017

@vgoklani @mal can you try the same test disabling the timeout? I know its not ideal but its to check if the error is because of task cancellation or something else

@mal
Copy link

mal commented May 24, 2017

Just ran it again with timeout=None in the session.request call, doesn't appear to have changed behaviour any, connections still dry up. I've cleaned up my testcase a bit and made it reproducible locally. Hopefully this will help: https://gist.github.com/mal/53506cd81fa27c39afc4f7fbef7b7b5a

FWIW, it also runs through requests about 4 times faster using asyncio.Semaphore locally.

@thehesiod
Copy link
Contributor

highly recommend trying with 3.5.2 to see if it's related to the ssl leak. Also I use a pattern like this for doing multiple tasks in parallel: https://gist.github.com/thehesiod/7081ab165b9a0d4de2e07d321cc2391d. Spawning too many async tasks makes things very slow as it has to navigate through the giant list of tasks.

@dteh
Copy link

dteh commented May 25, 2017

@thehesiod I have the same issue, running 3.5.2. I ran this snippet overnight:

async def page_get(url):
    while True:
        try:
            t = time.time()
            async with aiohttp.request("GET", url) as req:
                resp_content = await req.read()
            print(time.time() - t, "seconds taken")
        except Exception as e:
            print(str(datetime.datetime.now()) + " - Connection error: {0}".format(str(e)))
        await asyncio.sleep(2)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait([page_get('http://www.google.com')]))

0.05777549743652344 seconds taken
0.05143165588378906 seconds taken
0.045487165451049805 seconds taken
0.0573725700378418 seconds taken
0.049524545669555664 seconds taken

Today when I checked it, I was getting:

1.3857007026672363 seconds taken
0.47487688064575195 seconds taken
0.23848772048950195 seconds taken
2.2507810592651367 seconds taken
0.25020551681518555 seconds taken
0.12477540969848633 seconds taken
2.4836790561676025 seconds taken
1.1492459774017334 seconds taken

@mal
Copy link

mal commented May 25, 2017

@thehesiod running the testcase I posted against 3.5.2, 3.5.3 and for good measure 3.6.1, all show the same degredation around the 7k mark.

@thehesiod
Copy link
Contributor

Mind taking snapshots when fast and slow with pyvmmonitor?

@mal
Copy link

mal commented May 27, 2017

For whatever reason pyvmmonitor won't play ball (X11 issues, missing python modules, unmentioned missing shared object files). Happy to provide cProfiles if those are helpful, but if you need the pyvmmonitor output it looks like your best bet is running the testcase yourself I'm afraid, sorry I can't be of more help 🙁

@mal
Copy link

mal commented May 27, 2017

Of course as soon as I said that, I realised what was wrong. >_<

The testcase now includes fast and slow variants (fast = semaphore, slow = limit param) and the pstat file from running each.

@thehesiod
Copy link
Contributor

nice, that should help us, thanks!

@thehesiod
Copy link
Contributor

looks like in the slow case for some reason ~60% of the time is spent in select.epoll. I can't even find it in the fast case yet on my laptop monitor. Also I noticed in slow.py you're not draining the connection before returning it to the pool (reading body). I forget if this is now done automatically by ClientResponse.__aexit__. Between your fast and slow you don't pass limit to fast, so I'm guessing the 'slowness' is aiohttp waiting to return connections to the pool. Any ideas @fafhrd91 ?

@thehesiod
Copy link
Contributor

@mal have you checked SSL vs non-SSL? Would help narrow the problem

@thehesiod
Copy link
Contributor

if SSL only may want to try with python/cpython#2270

@mal
Copy link

mal commented Jun 20, 2017

Sorry for the delay, I was on vacation last week. I've not tried SSL vs non-SSL but the testcase I was using is plain old HTTP not HTTPS. Is there any reason to suspect that performance would be better with the additional overhead of HTTPS?

@thehesiod
Copy link
Contributor

Ah ok. No just thought I'd check it were https only could be explained by existing issues

@asvetlov
Copy link
Member

Guys could you test against new aiohttp 2.2.0 release?

@mal
Copy link

mal commented Jun 20, 2017

Just re-ran the testcase using 2.2.0 issue still exists. 😞

@mal
Copy link

mal commented Dec 14, 2017

Just re-ran the testcase against 2.3.6 and can confirm that the issue is still present 🙁 In this latest run using the internal limit resulted in a total time for 10000 that was ~4x slower (20.13s) than the manually implemented semaphore approach (5.83).

@asvetlov asvetlov added this to the 3.0 milestone Dec 14, 2017
@asvetlov
Copy link
Member

I'll try to find time if nobody from @aio-libs/aiohttp-committers pick up the issue first

@asvetlov asvetlov modified the milestones: 3.0, 3.1 Feb 12, 2018
@asvetlov asvetlov modified the milestones: 3.1, 3.2 Mar 22, 2018
@cecton
Copy link
Contributor

cecton commented Apr 11, 2018

It seems that @dionhaefner fixed it on DHI-GRAS/itsybitsy#1 maybe we can have some info here?

@dionhaefner
Copy link

Not exactly fixed - I just used @mal's workaround to use a Semaphore for rate-limiting rather than the limit keyword:

sem = asyncio.Semaphore(max_connections)
connector = aiohttp.TCPConnector(limit=None)
session = aiohttp.ClientSession(connector=connector)

async def get_page(url):
    async with sem:
        async with session.get(url):
            ...

performs well for me.

@thehesiod
Copy link
Contributor

@dteh it's hard to use connections to google.com as a testcase given its response times can vary widely, I suggest trying again with a local server.

@thehesiod
Copy link
Contributor

@mal I'll let @asvetlov comment on the non-semaphore case, but my understanding is that adding a lot of tasks will make everything slow as the message loop is going to walk through all the tasks on each blocking call. The semaphore / workerpool (https://gist.github.com/thehesiod/7081ab165b9a0d4de2e07d321cc2391d) is the better strategy to only have as many work tasks active as you are able to process.

@pfreixes
Copy link
Contributor

The implementation of the asyncio.Semaphore, asyncio.Queue and the aiohttp.BaseConnector.connect that allows making back pressure to limit the number of concurrent connection share the same pattern and have a quite identical implementation. So:

  1. The aiohttp.BaseConnector.connect has a bug implementing the connection limitations.
  2. There is a bug between the .get and the .connect functions that appear when the connections are limited and is addressed using an outer semaphore (the back pressures is made at the beginning)
  3. Others.

@mal
Copy link

mal commented Apr 12, 2018

@thehesiod it's been a while since I looked at this, but it sounds like you're suggesting that having to use a semaphore or queue to get it performant is expected? If that is the case, I'd question what the limit parameter is meant to achieve, because (from memory) it's description suggests it would support the identified use case. If it's not designed to do that then maybe that's a feature that could be considered?

@cecton
Copy link
Contributor

cecton commented Apr 12, 2018

To me it seems that this issue is blocking enough to consider deploying a solution that works but might not be perfect. I know it's there for a year now and there wasn't much complaint but... Imagine people who wants to run in production with aiohttp, they can't because of such issue.

@asvetlov I propose we double check and confirm this issue, then apply this manual semaphore mechanism directly to aiohttp on all maintained versions (as it seems to be the best way to solve it). We can still improve/replace it later if needed. What do you think?

@asvetlov
Copy link
Member

As @pfreixes commented if aiohttp has a bug -- it should be fixed.
We need to discover what's going on first.

@thehesiod
Copy link
Contributor

@mal the limit parameter is to limit the number of concurrent connections, not in general to apply any type of pressure. Just as a note I use aiohttp to make millions of requests every day in a prod environment across multiple services for years and haven't run into this issue by using things like worker pools and semaphores.

@mal
Copy link

mal commented Apr 13, 2018

Maybe the issue is one of expectation? The crux of this seems to be "is this a bug, or is another method meant to used to support this use case?" and no one really seems to be sure. I'd argue that right now it's at best unexpected behaviour that looks like a bug to the user, which results in a response ranging from mild frustration to simply not utilising the library.

@thehesiod
Copy link
Contributor

@mal after further reading of your code I didn't realize how similar you made the two codepaths whereby in both cases you have the same number of outstanding tasks, in one case they're waiting on a semaphore to unblock them, and in the other on the TCPConnector limiting. I did another change to make the fast case use a LIMIT as well to ensure both cases are similar and it had no effect on performance for the fast case which points to an issue in aiohttp.

So this is rather interesting. we know per the performance trace that the selectors.py:select method is causing things to run slowly (doesn't even show up in the fast trace), so it sounds like aiohttp is sub-optimally limiting connections as you suggested. I now think this is definitely a performance issue in aiohttp. I'm guessing it's due to the waiters mechanism in BaseConnector. You can envision that at start LIMIT connections get created, then LIMIT * 99 end up as "waiters". I'm wondering if this can be replaced with a simple semaphore or condition.

apologies for making an assumption about your example. Thanks again for pushing this issue! I think I'll have some time to experiment a bit with this today but hopefully @asvetlov can pursue this separately. I'm not a contributor but am an interested party :)

@thehesiod
Copy link
Contributor

so I created a PR which greatly speeds up both cases. What this does is avoid listing through the the list of waiters in non-exceptional cases

@thehesiod
Copy link
Contributor

on my machine it went from ~13s to 3s in the slow case

@thehesiod
Copy link
Contributor

thehesiod commented Apr 19, 2018

anyone want to verify the results from my PR?

@asvetlov
Copy link
Member

asvetlov commented May 4, 2018

Should be fixed by #2937 and #2964

@asvetlov asvetlov closed this as completed May 4, 2018
@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests