Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amount of triggers and executions lead to performance issues #6024

Open
jgoguette opened this issue Nov 20, 2024 · 7 comments · May be fixed by #6050
Open

Large amount of triggers and executions lead to performance issues #6024

jgoguette opened this issue Nov 20, 2024 · 7 comments · May be fixed by #6050
Assignees
Labels
area/backend Needs backend code changes bug Something isn't working

Comments

@jgoguette
Copy link

jgoguette commented Nov 20, 2024

Describe the issue

When a large amount of triggers/executions occur the application starts to slow down immensely.
Executions are created, but no tasks actually start even when it's in the Running state.
And If tasks start they take a very long time to actually execute.

If tried setting up concurrency. With that I just get a huge backlog of executions to process.

This is a screenshot showing the current situation:
Screenshot 2024-11-20 at 10 42 15 AM

Examples showing execution created hours ago and it currently in the Running state yet no tasks are created:
Screenshot 2024-11-20 at 11 14 13 AM
Screenshot 2024-11-20 at 11 14 25 AM

My current Setup for the triggers include:

  • 3 Real time debezium mysql triggers
    -- 1 constant load (200+ per minute)
    -- 1 hourly bulk load (30K+ per hour)
    -- 1 minimal load (50+ per hour)
  • 1 Webhook
    -- Constant load (Roughly 1500+ per minute)

Using pg_stat_statements postgresql extension with the following query:

SELECT
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time
FROM
    pg_stat_statements
ORDER BY
    total_exec_time DESC
LIMIT 10;

I was able to identify the top slowest queries which you can find in this google sheet linked HERE

Screenshot for quick glance:
Screenshot 2024-11-20 at 1 42 06 PM

Just in case this is needed, I've included the database details:

kestra=> SELECT
    n.nspname || '.' || c.relname AS full_table_name,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
    pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
    pg_size_pretty(pg_total_relation_size(c.oid) - pg_relation_size(c.oid)) AS indexes_size
FROM
    pg_class c
    JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE
    c.relkind = 'r'  -- Only base tables
    AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY
    pg_total_relation_size(c.oid) DESC
LIMIT 10;
      full_table_name      | total_size | table_size | indexes_size
---------------------------+------------+------------+--------------
 public.queues             | 28 GB      | 7288 MB    | 21 GB
 public.executions         | 4191 MB    | 1050 MB    | 3141 MB
 public.execution_queued   | 1460 MB    | 702 MB     | 758 MB
 public.logs               | 570 MB     | 412 MB     | 158 MB
 public.metrics            | 294 MB     | 233 MB     | 60 MB
 public.executorstate      | 265 MB     | 214 MB     | 51 MB
 public.service_instance   | 1032 kB    | 976 kB     | 56 kB
 public.worker_job_running | 432 kB     | 8192 bytes | 424 kB
 public.flows              | 336 kB     | 48 kB      | 288 kB
 public.triggers           | 160 kB     | 80 kB      | 80 kB
(10 rows)

Environment

  • Kestra Version: latest

I'm using AWS:

  • EC2: m6i.4xlarge
  • RDS: db.r6g.4xlarge
  • Docker Swarm:
volumes:
  kestra-data:
    driver: local

services:
  kestra-webserver:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: &env
      - .env_encoded
    user: "root"
    command: server webserver
    volumes: &volumes
      - kestra-data:/app/storage
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/kestra-wd:/tmp/kestra-wd
    environment: &configuration
      KESTRA_CONFIGURATION: |
        datasources:
          postgres:
            url: jdbc:postgresql://endpoint:5432/kestra
            driverClassName: org.postgresql.Driver
            username: username
            password: password
        kestra:
          jdbc:
            cleaner:
              initialDelay: 1h
              fixedDelay: 1h
              retention: 3d
          tutorialFlows:
            enabled: false
          server:
            basic-auth:
              enabled: false
              username: "admin@kestra.io" # it must be a valid email address
              password: kestra
          repository:
            type: postgres
          storage:
            type: s3
            s3:
              accessKey: "accessKey"
              secretKey: "secretKey"
              region: "region"
              bucket: "bucket"
          queue:
            type: postgres
          tasks:
            tmp-dir:
              path: /tmp/kestra-wd/tmp
          url: https://url.com
          variables:
            globals:
              env: staging
          environment:
            name: Staging
            color: "#FCB37C"
    ports:
      - "80:8080"
      - "8081:8081"
    networks:
      - shared-overlay-network

  kestra-worker-1:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server worker --thread=256
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-executor-1:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server executor
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-worker-2:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server worker --thread=256
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-executor-2:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server executor
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-worker-3:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server worker --thread=256
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-executor-3:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server executor
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

  kestra-scheduler-1:
    image: kestra/kestra:v0.19.9
    deploy:
      replicas: 2
      restart_policy:
        condition: on-failure
        max_attempts: 30
    env_file: *env
    user: "root"
    command: server scheduler
    volumes: *volumes
    environment: *configuration
    depends_on:
    - kestra-webserver
    networks:
      - shared-overlay-network

networks:
  shared-overlay-network:
    external: true
    name: shared-overlay-network
@jgoguette jgoguette added area/backend Needs backend code changes area/frontend Needs frontend code changes bug Something isn't working labels Nov 20, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Issues Nov 20, 2024
@jgoguette jgoguette changed the title Poor performance when there is a large amount of triggers/executions to process. Extremely Poor performance when there is a large amount of triggers/executions to process. Nov 20, 2024
@MilosPaunovic MilosPaunovic removed the area/frontend Needs frontend code changes label Nov 20, 2024
@loicmathieu
Copy link
Member

Hi,
Thanks for the detailed report. Some query performance is surprising, so I need to dig to understand what's going on.

As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?

All services are started once in a big node, and you have a big database, regarding the database spec it should be enough so I need to look at the query performances.

I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?

@loicmathieu
Copy link
Member

4 of the slow queries, the ones with an execution time of more than 1s, are from dashboards (home, flow, execution), if you have a lot of executions it is normal that those queries are slow, there is today only one possibility is to purge more often the executions inside the database.

Reducing the dashboard period would also makes things better but we didn't offer a global settings for that yet.

The first query is to pop a queued execution from the list of queued executions, we use a lock to be able to select then delete the record from the database. I checked and it correctly use the index.
I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.

Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.

@jgoguette
Copy link
Author

As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?

Each flow has only 2 tasks: 1 to log which topic, and another to produce a kafka message

I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?

All 4 of my flows had concurrency limit set in the screenshot above.

I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.

I set the concurrency limit on each of the 4 flows to about 125 (500 total)

Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.

So is the bottleneck here only the database? And to simply increase the specs? I'd imagine db.r6g.4xlarge would be sufficient enough for the work.

@loicmathieu
Copy link
Member

Can you try to create an index to see if it provides any improvements?

create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");

I'd imagine db.r6g.4xlarge would be sufficient enough for the work.

It depends how many executions you keep in the database, we have nice dashboards that shows execution overview for the last 30 days, those can consume a lot of database resource if they are displayed frequently. That's why I talked about purging executions.
If you didn't already purge executions and have a high number of executions in the database, this may explain some slowness.

I set the concurrency limit on each of the 4 flows to about 125 (500 total)

I would not expect that!
Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number.
It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!

@jgoguette
Copy link
Author

This is what I'm working with right now without the concurrency.
Screenshot 2024-11-21 at 2 24 31 PM

Once i removed the concurrency limit on the tasks it seemed to perform a lot better.

Since I cleared my db:

create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");

I'm won't be able to test this out with my previous load in the screenshot :(

I would not expect that!
Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number.
It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!

This makes it clear to me. Thanks for the explanation.

Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?

I'm currently using:

EC2: 
- Type: m6i.4xlarge
- CPU: 16
- Memory: 64GB

What would be the appropriate amount of worker threads to set for this setup?

@anna-geller anna-geller changed the title Extremely Poor performance when there is a large amount of triggers/executions to process. Large amount of triggers and executions lead to performance issues Nov 22, 2024
@loicmathieu
Copy link
Member

Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?

By default, it will use 4 times the number of CPU cores, which is a sane default.
As CPU utilization will depend on the type of tasks you run, we cannot do anything smarter.
For ex, if you use a lot of I/O tasks, you may use more threads are those would most probably been waiting for I/O, but if you do more CPU intensive tasks (calculation or mapping) it may already be a lot.

If you see low CPU utilization during load, you can increase the number of threads but the default configuration should be a good compromise.

@loicmathieu
Copy link
Member

Anyway, as you validate that concurrency limit is what caused the issue it help us to find some performance improvements in this area!
I would also try to improve documentation.

Thanks for your detailed feedback, it help us ;)

@loicmathieu loicmathieu linked a pull request Nov 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend Needs backend code changes bug Something isn't working
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants