Add queue_index_segment_entry_count configuration #2939

lhoguin · 2021-03-29T09:40:57Z

The default value of ?SEGMENT_ENTRY_COUNT is 16384. Due to
the index file format the entire segment file has to be loaded
into memory whenever messages from this segment must be accessed.

This can result in hundreds of kilobytes that are read, processed
and converted to Erlang terms. This creates a lot of garbage in
memory, and this garbage unfortunately ends up in the old heap
and so cannot be reclaimed before a full GC. Even when forcing
a full GC every run (fullsweep_after=0) the process ends up
allocating a lot of memory to read segment files and this can
create issues in some scenarios.

While this is not a problem when there are only a small number
of queues, this becomes a showstopper when the number of queues
is more important (hundreds or thousands of queues). This only
applies to classic/lazy queues.

This commit allows configuring the segment file entry count
so that the size of the file can be greatly reduced, as well
as the memory footprint when reading from the file becomes
necessary.

Experiments using a segment entry count of 1024 show no
noticeable downside other than the natural increase in the
number of segment files.

The segment entry count can only be set on nodes that have
no messages in their queue index. This is because the index
uses this value to calculate in which segment file specific
messages are sitting in.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
All tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in related repositories

Further Comments

If this is a relatively large or complex change, kick off the discussion by
explaining why you chose the solution you did and what alternatives you
considered, etc.

lhoguin · 2021-03-29T10:25:01Z

This is what the memory looks like running over the week-end with a reduced segment entry count (1024 instead of 16384):

As you can see RSS is stable at 1.5G (1.2-1.4G according to management). This is with 800 lazy queues. Current RabbitMQ master would not manage with even 500 lazy queues as that would reach the memory_high_watermark at 2.2G.

There are about 20 million messages in the queues total with an expiration of 3 hours.

Process memory is typically in the 1-2MB range:

lhoguin · 2021-03-29T14:15:55Z

I have tested the patch under Kubernetes.

@gerhard Do you want to try it with the customer payload?

@pjk25 I get "Permission denied" when I want to see the details of the bazel check.

lhoguin · 2021-03-30T10:16:50Z

Patch running fine for 22 hours so far:

gerhard · 2021-03-30T11:28:35Z

I'm picking this up now. Do not merge before the results are back in.

gerhard · 2021-03-30T13:33:36Z

@pjk25 do we have an example where we cache the make generic-unix-package result?

I would like to speed up the current feedback loop by ~4 minutes:

HoloRin · 2021-03-30T14:09:37Z

@gerhard there are some caching example here https://github.com/rabbitmq/rabbitmq-server/blob/rabbitmq-server-2877/workflow_sources/test/tests.lib.yml but I suspect they aren't really what you are looking for. The main workflow used some caching for the secondary umbrellas some time ago in https://github.com/rabbitmq/rabbitmq-server/blob/eefb8c2792a85c281416d8cb1a2092d305380db6/.github/workflows/test-erlang-otp-23.1.yaml

gerhard · 2021-04-06T09:53:43Z

I am adding the results of my findings to this thread. As soon as I am done, I will merge. This feels safe to backport to v3.8.x. Let me know if you disagree @michaelklishin.

Is now a good time to remove the WIP from the title?

gerhard · 2021-04-06T10:03:14Z

@pjk25 our tests are not looking healthy. Test / test (23) in GitHub Actions is failing & bazel test is Disconnected (Permission denied).

We can either remove these checks or fix them. I don't have enough current context to understand the feasibility of either option. As long as we don't allow ourselves to be OK with failing tests, I don't mind which option we choose.

What do you think @pjk25?

deps/rabbit/Makefile

HoloRin · 2021-04-06T10:13:13Z

@gerhard it looks to me that the build fails because I made some breaking changes to https://github.com/rabbitmq/bazel-erlang since this branch was created. Rebasing or merging master should fix it.

The default value of ?SEGMENT_ENTRY_COUNT is 16384. Due to the index file format the entire segment file has to be loaded into memory whenever messages from this segment must be accessed. This can result in hundreds of kilobytes that are read, processed and converted to Erlang terms. This creates a lot of garbage in memory, and this garbage unfortunately ends up in the old heap and so cannot be reclaimed before a full GC. Even when forcing a full GC every run (fullsweep_after=0) the process ends up allocating a lot of memory to read segment files and this can create issues in some scenarios. While this is not a problem when there are only a small number of queues, this becomes a showstopper when the number of queues is more important (hundreds or thousands of queues). This only applies to classic/lazy queues. This commit allows configuring the segment file entry count so that the size of the file can be greatly reduced, as well as the memory footprint when reading from the file becomes necessary. Experiments using a segment entry count of 1024 show no noticeable downside other than the natural increase in the number of segment files. The segment entry count can only be set on nodes that have no messages in their queue index. This is because the index uses this value to calculate in which segment file specific messages are sitting in.

lhoguin · 2021-04-06T10:26:07Z

Rebased to master.

HoloRin · 2021-04-06T11:17:38Z

After the rebase it appears that:

//deps/rabbit:quorum_queue_SUITE-clustered
//deps/rabbit:rabbit_stream_queue_SUITE-cluster_size_3

have failed. I believe these two groups from their corresponding suites have been known to flake. I will retrigger the actions - it should only run the flaky tests this time.

gerhard · 2021-04-06T11:59:28Z

Key take-aways

Nodes use 25% - 35% less memory (14GB vs 22GB) with this patch and {queue_index_segment_entry_count, 1024}.
In a large fanout scenario - 1 msg to 50k queues - the number of messages routed to queues is 50% higher (4.9k vs 3.3k).
Disk space available shrinks and grows as expected. Before this patch, disk space would only shrink. It looks to me as if this change fixes a bug in scenarios where messages are only incoming (no consumers, only publishers), queues reach their message limit and start discarding old messages (default behaviour). This is a nice surprise, but I haven't look too closely into it, other than what the Disk space available before publishers blocked graph is telling us (see screenshot below).

Test setup

We used the #2785 workload as the starting point.

Erlang 23.2.4 was built via the recently introduced OCI GitHub Action and we deployed on GKE with Cluster Operator. Expand the attached K8S manifest if you're interested to see all those details:

K8S rabbitmq.yml

apiVersion: v1
kind: Namespace
metadata:
  name: vesc-1002
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: pr-2939-ca
  namespace: vesc-1002
spec:
  ca:
    secretName: pr-2939-ca
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: pr-2939-cert
  namespace: vesc-1002
spec:
  secretName: pr-2939-nodes-tls
  duration: 2160h
  renewBefore: 360h
  subject:
    organizations:
    - RabbitMQ
  commonName: pr-2939
  isCA: false
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
  - server auth
  - client auth
  dnsNames:
  - '*.pr-2939-nodes.vesc-1002'
  - '*.pr-2939.vesc-1002'
  issuerRef:
    name: pr-2939-ca
    kind: Issuer
    group: cert-manager.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pr-2939-tls-config
  namespace: vesc-1002
data:
  inter_node_tls.config: |
    [
      {server, [
        {cacertfile, "/etc/rabbitmq/certs/ca.crt"},
        {certfile,   "/etc/rabbitmq/certs/tls.crt"},
        {keyfile,    "/etc/rabbitmq/certs/tls.key"},
        {secure_renegotiate, true},
        {fail_if_no_peer_cert, true},
        {verify, verify_peer},
        {customize_hostname_check, [
          {match_fun, public_key:pkix_verify_hostname_match_fun(https)}
        ]}
      ]},
      {client, [
        {cacertfile, "/etc/rabbitmq/certs/ca.crt"},
        {certfile,   "/etc/rabbitmq/certs/tls.crt"},
        {keyfile,    "/etc/rabbitmq/certs/tls.key"},
        {secure_renegotiate, true},
        {fail_if_no_peer_cert, true},
        {verify, verify_peer},
        {customize_hostname_check, [
          {match_fun, public_key:pkix_verify_hostname_match_fun(https)}
        ]}
      ]}
    ].
---
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: pr-2939
  namespace: vesc-1002
spec:
  replicas: 3
  image: pivotalrabbitmq/rabbitmq:3cab7d59a64ceb1f10eb0153cb001b17d71474b4
  resources:
    requests:
      cpu: 6
      memory: 40Gi
    limits:
      cpu: 6
      memory: 40Gi
  persistence:
    storageClassName: ssd
    storage: 100Gi
  rabbitmq:
    envConfig: |
      SERVER_ADDITIONAL_ERL_ARGS="-pa /usr/local/lib/erlang/lib/ssl-10.1/ebin -proto_dist inet_tls -ssl_dist_optfile /etc/rabbitmq/inter-node-tls.config"
      RABBITMQ_CTL_ERL_ARGS="-pa /usr/local/lib/erlang/lib/ssl-10.1/ebin -proto_dist inet_tls -ssl_dist_optfile /etc/rabbitmq/inter-node-tls.config"
    additionalConfig: |
      disk_free_limit.relative = 1.0
      vm_memory_high_watermark.relative = 0.7
      # management metrics can be disabled altogether
      # management_agent.disable_metrics_collector = true
      cluster_partition_handling = autoheal
      cluster_name = pr-2939
    advancedConfig: |
      [
        {rabbit, [
          {queue_index_segment_entry_count, 1024}
        ]}
      ].
    additionalPlugins:
    - rabbitmq_amqp1_0
  service:
    type: LoadBalancer
  tls:
    secretName: pr-2939-nodes-tls
    disableNonTLSListeners: true
  tolerations:
  - key: dedicated
    operator: Equal
    value: vesc-1002
    effect: NoSchedule
  override:
    statefulSet:
      spec:
        template:
          spec:
            volumes:
            - configMap:
                defaultMode: 420
                name: pr-2939-tls-config
              name: inter-node-config
            - name: pr-2939-nodes-tls
              secret:
                secretName: pr-2939-nodes-tls
                items:
                - key: ca.crt
                  mode: 416
                  path: ca.crt
                - key: tls.crt
                  mode: 416
                  path: tls.crt
                - key: tls.key
                  mode: 416
                  path: tls.key
            containers:
            - name: rabbitmq
              volumeMounts:
              - mountPath: /etc/rabbitmq/certs
                name: pr-2939-nodes-tls
              - mountPath: /etc/rabbitmq/inter-node-tls.config
                name: inter-node-config
                subPath: inter_node_tls.config
              securityContext:
                privileged: true
                runAsUser: 0
            nodeSelector:
              com.rabbitmq: vesc-1002
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: rabbitmq
spec:
  podMetricsEndpoints:
  - interval: 15s
    scheme: https
    port: "15691"
  selector:
    matchLabels:
      app.kubernetes.io/component: rabbitmq
  namespaceSelector:
    any: false

We ended with:

3-node cluster with 17k classic lazy queues (no replication) per node for a total of 50k queues
75mil messages @ 4KB each evenly distributed across all 50k queues (each queue limited to 1.5k messages)

RabbitMQ Management Overview

Before:

After:

RabbitMQ Overview Dashboard

This captures all nodes across both clusters. Notice the more stable & abundant memory after this patch. Notice the disk space usage after this patch: it gets freed up periodically, but only shrinks before this patch. Messages routed to queues is consistently 50% higher after this patch (the node with this patch is the red one in that graph).

Erlang Allocators Dashboard

Before:

After:

In conclusion

This patch enables nodes with many classic lazy queues, as the one described in #2785, to use less memory by keeping fewer messages in memory. The 16k default remains unchanged, but setting this value to 1k results in at least 25% memory usage and a more predictable utilisation pattern, as captured above. Great job @lhoguin, let's ship it 🛳️ 🚢 🤠

Disclaimer: by setting this advanced config property on existing nodes, with queues and messages, you take full responsibility of what happens when nodes reboot. Don't do it. It's only meant to be used on fresh deployments, and never changed. If you want to know why, see @lhoguin comments above, or go straight for the code. More power to you!

cc @motmot80 via #2785
cc @Gsantomaggio @carlhoerberg via #1284

gerhard · 2021-04-06T12:06:08Z

I am backporting this to v3.8.x. Do we have Bazel there @pjk25?

This makes me think that we don't:

CONFLICT (modify/delete): deps/rabbit/BUILD.bazel deleted in HEAD and modified in 98f33fcc3d (Merge pull request #2939 from rabbitmq/configurable-segment-entry-count). Version 98f33fcc3d (Merge pull request #2939 from rabbitmq/configurable-segment-entry-count) of deps/rabbit/BUILD.bazel left in tree.

michaelklishin · 2021-04-06T12:09:34Z

We don't but this may be a good moment to backport most of the Bazel work done up to this point.

gerhard · 2021-04-06T12:13:00Z

Thanks @michaelklishin, leaving that to @pjk25 👍🏻

Add queue_index_segment_entry_count configuration (cherry picked from commit 98f33fc) Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

gerhard · 2021-04-06T12:18:34Z

Backported to v3.8.x via 08b286c

lhoguin · 2021-04-06T12:57:47Z

Mind blown by #1284 having much of the problem already figured out.

lhoguin mentioned this pull request Mar 29, 2021

Limit the number of messages in q3 for lazy queues #2906

Closed

11 tasks

lhoguin changed the title ~~WIP (not tested yet): Add queue_index_segment_entry_count configuration~~ WIP: Add queue_index_segment_entry_count configuration Mar 29, 2021

gerhard force-pushed the configurable-segment-entry-count branch from fc7b556 to 54f0ca0 Compare April 1, 2021 11:42

HoloRin reviewed Apr 6, 2021

View reviewed changes

deps/rabbit/Makefile Show resolved Hide resolved

lhoguin changed the title ~~WIP: Add queue_index_segment_entry_count configuration~~ Add queue_index_segment_entry_count configuration Apr 6, 2021

Loïc Hoguin added 2 commits April 6, 2021 12:25

Add new configuration variable to BUILD.bazel

3cab7d5

lhoguin force-pushed the configurable-segment-entry-count branch from dfc360c to 3cab7d5 Compare April 6, 2021 10:25

gerhard merged commit 98f33fc into master Apr 6, 2021

gerhard deleted the configurable-segment-entry-count branch April 6, 2021 12:03

gerhard added a commit that referenced this pull request Apr 6, 2021

Merge pull request #2939 from rabbitmq/configurable-segment-entry-count

08b286c

Add queue_index_segment_entry_count configuration (cherry picked from commit 98f33fc) Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>

Add queue_index_segment_entry_count configuration #2939

Add queue_index_segment_entry_count configuration #2939

Uh oh!

Conversation

lhoguin commented Mar 29, 2021

Types of Changes

Checklist

Further Comments

Uh oh!

lhoguin commented Mar 29, 2021

Uh oh!

lhoguin commented Mar 29, 2021

Uh oh!

lhoguin commented Mar 30, 2021

Uh oh!

gerhard commented Mar 30, 2021

Uh oh!

gerhard commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HoloRin commented Mar 30, 2021

Uh oh!

gerhard commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gerhard commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HoloRin commented Apr 6, 2021

Uh oh!

lhoguin commented Apr 6, 2021

Uh oh!

HoloRin commented Apr 6, 2021

Uh oh!

gerhard commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key take-aways

Test setup

RabbitMQ Management Overview

RabbitMQ Overview Dashboard

Erlang Allocators Dashboard

In conclusion

Uh oh!

gerhard commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelklishin commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gerhard commented Apr 6, 2021

Uh oh!

gerhard commented Apr 6, 2021

Uh oh!

lhoguin commented Apr 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gerhard commented Mar 30, 2021 •

edited

Loading

gerhard commented Apr 6, 2021 •

edited

Loading

gerhard commented Apr 6, 2021 •

edited

Loading

gerhard commented Apr 6, 2021 •

edited

Loading

gerhard commented Apr 6, 2021 •

edited

Loading

michaelklishin commented Apr 6, 2021 •

edited

Loading