Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly performance regression in the latest versions of locust #2690

Closed
2 tasks done
morrisonli76 opened this issue Apr 26, 2024 · 17 comments
Closed
2 tasks done

Possibly performance regression in the latest versions of locust #2690

morrisonli76 opened this issue Apr 26, 2024 · 17 comments
Labels
bug stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it

Comments

@morrisonli76
Copy link

Prerequisites

Description

I used to use Amazon Linux 2 as the base OS for my load tests. Because the python available on that OS is 3.7, the latest locust I could get was 2.17.0. With 5 c5n.xlarge EC2 instances (each has 4 vCPU) as workers, I could use spawn 1200 users. The wait_time for the test was set to constant_thoughtput(1) so that the total 1200 rps stress could be achieved.

Recently, I updated the base OS to Amazon Linux 2023. The python version became 3.11. I could use the latest version of locust - 2.26.0. However, the above setup (5 c5n.xlarge EC2 instances) could not provide the desired load. It could only spawn totally about 830 users. The total rsp was only around 330 even though the wait_time was still constant_thoughtput(1). I noticed that CPU usage of each worker process was close to 100% already.

The server being tested did not change. The same locustfile was used for tests. However, the performance between the above 2 locust setup was day and night difference. This seems like a regression.

Here is the output of the python 3.11 environment:
Package Version

blinker 1.7.0
Brotli 1.1.0
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
ConfigArgParse 1.7
Flask 3.0.3
Flask-Cors 4.0.0
Flask-Login 0.6.3
gevent 24.2.1
geventhttpclient 2.2.1
greenlet 3.0.3
idna 3.7
itsdangerous 2.2.0
Jinja2 3.1.3
locust 2.26.0
MarkupSafe 2.1.5
msgpack 1.0.8
pip 22.3.1
psutil 5.9.8
pyzmq 26.0.2
requests 2.31.0
roundrobin 0.0.4
setuptools 65.5.1
urllib3 2.2.1
Werkzeug 3.0.2
zope.event 5.0
zope.interface 6.3

Command line

master side: locust -f /opt/locustfile.py --master worker side: locust -f - --worker --master-host <master_ip> --processes -1

Locustfile contents

class QuickstartUser(HttpUser):
    def on_start(self):
        self.pixel_ids = self.environment.parsed_options.pixel_ids.split(",")
        self.client.verify = True if self.environment.parsed_options.verify_cert.lower() == "true" else False

    @task
    def cloudbridge(self):
        pixel_id = random.choice(self.pixel_ids)
        event_body = {
            "fb.pixel_id": pixel_id,
            "event_id": generate_event_id(),
            "event_name": self.environment.parsed_options.event_name,
            "conversion_value": {
                "value": "9",
                "currency": "USD",
            },
        }
        self.client.post(self.environment.parsed_options.path, json=event_body, name="event")
        self.client.close()

    wait_time = constant_throughput(2)

Python version

3.11

Locust version

2.26.0

Operating system

Amazon Linux 2023

@cyberw
Copy link
Collaborator

cyberw commented Apr 26, 2024

Hmm... There IS a known performance regression in OpenSSL 3.x (which was usually introduced in Python 3.12, but maybe your python build is different somehow?), see #2555

The issue will hit tests which close/reopen the connection especially hard (as the issue arises at ssl negotiation)

Can you check to see which ssl version you are running?
python -c "import ssl; print(ssl.OPENSSL_VERSION)"

As a workaround, see if you can run run another python version or keep connections alive (I know, not as realistic but better than nothing)

@morrisonli76
Copy link
Author

Hi, I used ubuntu 20.04 for Amazon EC2. I managed install the python 3.10 and the latest locust.

The CPU usage became low. However, the through put did not follow the constant_throughput(1) spec. 1500 users only gave me less than 800 rps.

Here is my python env:

(locust_env) ubuntu@ip-172-31-10-204:$ locust -V
locust 2.26.0 from /opt/locust_env/lib/python3.10/site-packages/locust (python 3.10.14)
(locust_env) ubuntu@ip-172-31-10-204:
$ python3.10 -m pip list
Package Version


blinker 1.8.1
Brotli 1.1.0
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
ConfigArgParse 1.7
Flask 3.0.3
Flask-Cors 4.0.0
Flask-Login 0.6.3
gevent 24.2.1
geventhttpclient 2.2.1
greenlet 3.0.3
idna 3.7
itsdangerous 2.2.0
Jinja2 3.1.3
locust 2.26.0
MarkupSafe 2.1.5
msgpack 1.0.8
pip 24.0
psutil 5.9.8
pyzmq 26.0.2
requests 2.31.0
roundrobin 0.0.4
setuptools 69.5.1
tomli 2.0.1
urllib3 2.2.1
Werkzeug 3.0.2
wheel 0.43.0
zope.event 5.0
zope.interface 6.3

@cyberw
Copy link
Collaborator

cyberw commented May 10, 2024

Hi! Did you check your ssl version?

python -c "import ssl; print(ssl.OPENSSL_VERSION)"

@morrisonli76
Copy link
Author

Yes, I did that. In fact I used ubuntu 20.04 which uses openssl 1.1.1f. I also updated the python to 3.10. With this setup, the CPU usage was lower, however, I found that even if I set wait = constant_throughput(1) for the test user, 1500 users only gave me less than 800 rps (I have already mentioned this in my previous reply). I did not see this issue when I use locust 2.17.0.

@cyberw
Copy link
Collaborator

cyberw commented May 11, 2024

What are your response times like? Wait times can only limit throughput, not increase it, so if a task takes more than 1s to complete you wont get 1 request/user/s.

@morrisonli76
Copy link
Author

The average response time is less than 700ms. Also, when I used older version of locust (e.g. 2.17.0), I did not have this issue.

@cyberw
Copy link
Collaborator

cyberw commented May 13, 2024

Hmm.. only thing I can think of is if Amazon is throttling somehow. What if you skip closing the session/connection? Can you see how many dns lookups are made? (Using tcpdump or something else). If you close the session then maybe there is a new dns lookup for each task iteration?

@morrisonli76
Copy link
Author

I can take a look if there is new dns lookup. However, the same target server and same tests, why locust 2.17.0 did not have the issue. Any major change to the connection logic?

@cyberw
Copy link
Collaborator

cyberw commented May 13, 2024

Not that I can think of :-/ But does 2.17.0 not exhibit this problem on python 3.11/Amazon Linux 2023?

@morrisonli76
Copy link
Author

Just report back. I changed my system combination. Right now, I am using Amazon Linux 2 with Python 3.10. The ssl version is 1.1.1g. I also follow the instruction https://repost.aws/knowledge-center/dns-resolution-failures-ec2-linux to enable the local dns cache. With this setup, the latency is much lower and CPU usage per worker is at low level as well.

However, even with this setup, the RPS does not hold. I run a test with 1200 users, each with constant_throughput(1) request rate. the RPS is quite far from 1200. It stopped around 800 and started to drop on its own.

@cyberw
Copy link
Collaborator

cyberw commented Jun 21, 2024

What are the response times? If a task takes more than the constant_pacing time, you’ll get falling throughput.

@morrisonli76
Copy link
Author

I tried to run the locust 2.17 on the exact same OS (Amazon Linux 2 with Python 3.10). It also showed the same issue. I think the issue is on the load test side because the server being tested is the same. I suspect there could be something in the OS environment that slows down the connection.

However, one thing I don't understand is that when the number of users reaches the desired number, the rps can not reach the expected number and starts to drop and eventually drop to a very low number. It seems locust loses control of creating new connections.

I have enabled local dns cache. Anything else would you suggest me to try out?

Thanks

@cyberw
Copy link
Collaborator

cyberw commented Jun 24, 2024

The main thing I would like to investigate is on the receiving end. Is there some throttling going on? How many locust workers are you using? Are they spread out over multiple machines? Are they passing thru a NAT?

However, one thing I don't understand is that when the number of users reaches the desired number, the rps can not reach the expected number and starts to drop and eventually drop to a very low number. It seems locust loses control of creating new connections.

Again I ask: What are your response times? If response times increase enough, you'll get falling RPS. Nothing to do with Locust, it is just math: If you have a certain number of concurrent users and response times go up you'll get falling throughput.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Aug 24, 2024
@morrisonli76
Copy link
Author

Just got the latest locust 2.31. Everything else was the same. The above issue was resolved. Any major improvement in 2.31?

@cyberw
Copy link
Collaborator

cyberw commented Aug 26, 2024

There was a performance fix in requests 2.32.0, but it should really only be needed for openssl 3.x, which you didn't have :) https://github.com/psf/requests/releases/tag/v2.32.0

But its nice that it works for you now :) Ok to close?

@cyberw
Copy link
Collaborator

cyberw commented Aug 26, 2024

Or maybe what you were experiencing was a version of this: #2812 ? That was fixed in Locust 2.31.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it
Projects
None yet
Development

No branches or pull requests

2 participants