Skip to content

Conversation

@lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Mar 29, 2021

The default value of ?SEGMENT_ENTRY_COUNT is 16384. Due to
the index file format the entire segment file has to be loaded
into memory whenever messages from this segment must be accessed.

This can result in hundreds of kilobytes that are read, processed
and converted to Erlang terms. This creates a lot of garbage in
memory, and this garbage unfortunately ends up in the old heap
and so cannot be reclaimed before a full GC. Even when forcing
a full GC every run (fullsweep_after=0) the process ends up
allocating a lot of memory to read segment files and this can
create issues in some scenarios.

While this is not a problem when there are only a small number
of queues, this becomes a showstopper when the number of queues
is more important (hundreds or thousands of queues). This only
applies to classic/lazy queues.

This commit allows configuring the segment file entry count
so that the size of the file can be greatly reduced, as well
as the memory footprint when reading from the file becomes
necessary.

Experiments using a segment entry count of 1024 show no
noticeable downside other than the natural increase in the
number of segment files.

The segment entry count can only be set on nodes that have
no messages in their queue index. This is because the index
uses this value to calculate in which segment file specific
messages are sitting in.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

Further Comments

If this is a relatively large or complex change, kick off the discussion by
explaining why you chose the solution you did and what alternatives you
considered, etc.

@lhoguin
Copy link
Contributor Author

lhoguin commented Mar 29, 2021

This is what the memory looks like running over the week-end with a reduced segment entry count (1024 instead of 16384):

Screenshot 2021-03-29 at 12 20 42

As you can see RSS is stable at 1.5G (1.2-1.4G according to management). This is with 800 lazy queues. Current RabbitMQ master would not manage with even 500 lazy queues as that would reach the memory_high_watermark at 2.2G.

There are about 20 million messages in the queues total with an expiration of 3 hours.

Process memory is typically in the 1-2MB range:

Screenshot 2021-03-29 at 12 24 19

@lhoguin lhoguin changed the title WIP (not tested yet): Add queue_index_segment_entry_count configuration WIP: Add queue_index_segment_entry_count configuration Mar 29, 2021
@lhoguin
Copy link
Contributor Author

lhoguin commented Mar 29, 2021

I have tested the patch under Kubernetes.

@gerhard Do you want to try it with the customer payload?

@pjk25 I get "Permission denied" when I want to see the details of the bazel check.

@lhoguin
Copy link
Contributor Author

lhoguin commented Mar 30, 2021

Patch running fine for 22 hours so far:

Screenshot 2021-03-30 at 12 15 35

@gerhard
Copy link
Contributor

gerhard commented Mar 30, 2021

I'm picking this up now. Do not merge before the results are back in.

@gerhard
Copy link
Contributor

gerhard commented Mar 30, 2021

@pjk25 do we have an example where we cache the make generic-unix-package result?

I would like to speed up the current feedback loop by ~4 minutes:

image

@HoloRin
Copy link
Contributor

HoloRin commented Mar 30, 2021

@gerhard there are some caching example here https://github.com/rabbitmq/rabbitmq-server/blob/rabbitmq-server-2877/workflow_sources/test/tests.lib.yml but I suspect they aren't really what you are looking for. The main workflow used some caching for the secondary umbrellas some time ago in https://github.com/rabbitmq/rabbitmq-server/blob/eefb8c2792a85c281416d8cb1a2092d305380db6/.github/workflows/test-erlang-otp-23.1.yaml

@gerhard gerhard force-pushed the configurable-segment-entry-count branch from fc7b556 to 54f0ca0 Compare April 1, 2021 11:42
@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

I am adding the results of my findings to this thread. As soon as I am done, I will merge. This feels safe to backport to v3.8.x. Let me know if you disagree @michaelklishin.

Is now a good time to remove the WIP from the title?

@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

@pjk25 our tests are not looking healthy. Test / test (23) in GitHub Actions is failing & bazel test is Disconnected (Permission denied).

We can either remove these checks or fix them. I don't have enough current context to understand the feasibility of either option. As long as we don't allow ourselves to be OK with failing tests, I don't mind which option we choose.

What do you think @pjk25?

@HoloRin
Copy link
Contributor

HoloRin commented Apr 6, 2021

@gerhard it looks to me that the build fails because I made some breaking changes to https://github.com/rabbitmq/bazel-erlang since this branch was created. Rebasing or merging master should fix it.

@lhoguin lhoguin changed the title WIP: Add queue_index_segment_entry_count configuration Add queue_index_segment_entry_count configuration Apr 6, 2021
Loïc Hoguin added 2 commits April 6, 2021 12:25
The default value of ?SEGMENT_ENTRY_COUNT is 16384. Due to
the index file format the entire segment file has to be loaded
into memory whenever messages from this segment must be accessed.

This can result in hundreds of kilobytes that are read, processed
and converted to Erlang terms. This creates a lot of garbage in
memory, and this garbage unfortunately ends up in the old heap
and so cannot be reclaimed before a full GC. Even when forcing
a full GC every run (fullsweep_after=0) the process ends up
allocating a lot of memory to read segment files and this can
create issues in some scenarios.

While this is not a problem when there are only a small number
of queues, this becomes a showstopper when the number of queues
is more important (hundreds or thousands of queues). This only
applies to classic/lazy queues.

This commit allows configuring the segment file entry count
so that the size of the file can be greatly reduced, as well
as the memory footprint when reading from the file becomes
necessary.

Experiments using a segment entry count of 1024 show no
noticeable downside other than the natural increase in the
number of segment files.

The segment entry count can only be set on nodes that have
no messages in their queue index. This is because the index
uses this value to calculate in which segment file specific
messages are sitting in.
@lhoguin lhoguin force-pushed the configurable-segment-entry-count branch from dfc360c to 3cab7d5 Compare April 6, 2021 10:25
@lhoguin
Copy link
Contributor Author

lhoguin commented Apr 6, 2021

Rebased to master.

@HoloRin
Copy link
Contributor

HoloRin commented Apr 6, 2021

After the rebase it appears that:

  • //deps/rabbit:quorum_queue_SUITE-clustered
  • //deps/rabbit:rabbit_stream_queue_SUITE-cluster_size_3

have failed. I believe these two groups from their corresponding suites have been known to flake. I will retrigger the actions - it should only run the flaky tests this time.

@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

Key take-aways

  • Nodes use 25% - 35% less memory (14GB vs 22GB) with this patch and {queue_index_segment_entry_count, 1024}.
  • In a large fanout scenario - 1 msg to 50k queues - the number of messages routed to queues is 50% higher (4.9k vs 3.3k).
  • Disk space available shrinks and grows as expected. Before this patch, disk space would only shrink. It looks to me as if this change fixes a bug in scenarios where messages are only incoming (no consumers, only publishers), queues reach their message limit and start discarding old messages (default behaviour). This is a nice surprise, but I haven't look too closely into it, other than what the Disk space available before publishers blocked graph is telling us (see screenshot below).

Test setup

We used the #2785 workload as the starting point.

Erlang 23.2.4 was built via the recently introduced OCI GitHub Action and we deployed on GKE with Cluster Operator. Expand the attached K8S manifest if you're interested to see all those details:

K8S rabbitmq.yml
apiVersion: v1
kind: Namespace
metadata:
  name: vesc-1002
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: pr-2939-ca
  namespace: vesc-1002
spec:
  ca:
    secretName: pr-2939-ca
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: pr-2939-cert
  namespace: vesc-1002
spec:
  secretName: pr-2939-nodes-tls
  duration: 2160h
  renewBefore: 360h
  subject:
    organizations:
    - RabbitMQ
  commonName: pr-2939
  isCA: false
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
  - server auth
  - client auth
  dnsNames:
  - '*.pr-2939-nodes.vesc-1002'
  - '*.pr-2939.vesc-1002'
  issuerRef:
    name: pr-2939-ca
    kind: Issuer
    group: cert-manager.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: pr-2939-tls-config
  namespace: vesc-1002
data:
  inter_node_tls.config: |
    [
      {server, [
        {cacertfile, "/etc/rabbitmq/certs/ca.crt"},
        {certfile,   "/etc/rabbitmq/certs/tls.crt"},
        {keyfile,    "/etc/rabbitmq/certs/tls.key"},
        {secure_renegotiate, true},
        {fail_if_no_peer_cert, true},
        {verify, verify_peer},
        {customize_hostname_check, [
          {match_fun, public_key:pkix_verify_hostname_match_fun(https)}
        ]}
      ]},
      {client, [
        {cacertfile, "/etc/rabbitmq/certs/ca.crt"},
        {certfile,   "/etc/rabbitmq/certs/tls.crt"},
        {keyfile,    "/etc/rabbitmq/certs/tls.key"},
        {secure_renegotiate, true},
        {fail_if_no_peer_cert, true},
        {verify, verify_peer},
        {customize_hostname_check, [
          {match_fun, public_key:pkix_verify_hostname_match_fun(https)}
        ]}
      ]}
    ].
---
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: pr-2939
  namespace: vesc-1002
spec:
  replicas: 3
  image: pivotalrabbitmq/rabbitmq:3cab7d59a64ceb1f10eb0153cb001b17d71474b4
  resources:
    requests:
      cpu: 6
      memory: 40Gi
    limits:
      cpu: 6
      memory: 40Gi
  persistence:
    storageClassName: ssd
    storage: 100Gi
  rabbitmq:
    envConfig: |
      SERVER_ADDITIONAL_ERL_ARGS="-pa /usr/local/lib/erlang/lib/ssl-10.1/ebin -proto_dist inet_tls -ssl_dist_optfile /etc/rabbitmq/inter-node-tls.config"
      RABBITMQ_CTL_ERL_ARGS="-pa /usr/local/lib/erlang/lib/ssl-10.1/ebin -proto_dist inet_tls -ssl_dist_optfile /etc/rabbitmq/inter-node-tls.config"
    additionalConfig: |
      disk_free_limit.relative = 1.0
      vm_memory_high_watermark.relative = 0.7
      # management metrics can be disabled altogether
      # management_agent.disable_metrics_collector = true
      cluster_partition_handling = autoheal
      cluster_name = pr-2939
    advancedConfig: |
      [
        {rabbit, [
          {queue_index_segment_entry_count, 1024}
        ]}
      ].
    additionalPlugins:
    - rabbitmq_amqp1_0
  service:
    type: LoadBalancer
  tls:
    secretName: pr-2939-nodes-tls
    disableNonTLSListeners: true
  tolerations:
  - key: dedicated
    operator: Equal
    value: vesc-1002
    effect: NoSchedule
  override:
    statefulSet:
      spec:
        template:
          spec:
            volumes:
            - configMap:
                defaultMode: 420
                name: pr-2939-tls-config
              name: inter-node-config
            - name: pr-2939-nodes-tls
              secret:
                secretName: pr-2939-nodes-tls
                items:
                - key: ca.crt
                  mode: 416
                  path: ca.crt
                - key: tls.crt
                  mode: 416
                  path: tls.crt
                - key: tls.key
                  mode: 416
                  path: tls.key
            containers:
            - name: rabbitmq
              volumeMounts:
              - mountPath: /etc/rabbitmq/certs
                name: pr-2939-nodes-tls
              - mountPath: /etc/rabbitmq/inter-node-tls.config
                name: inter-node-config
                subPath: inter_node_tls.config
              securityContext:
                privileged: true
                runAsUser: 0
            nodeSelector:
              com.rabbitmq: vesc-1002
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: rabbitmq
spec:
  podMetricsEndpoints:
  - interval: 15s
    scheme: https
    port: "15691"
  selector:
    matchLabels:
      app.kubernetes.io/component: rabbitmq
  namespaceSelector:
    any: false

We ended with:

  • 3-node cluster with 17k classic lazy queues (no replication) per node for a total of 50k queues
  • 75mil messages @ 4KB each evenly distributed across all 50k queues (each queue limited to 1.5k messages)

RabbitMQ Management Overview

Before:
Screenshot 2021-04-02 at 10 54 00

After:
Screenshot 2021-04-02 at 10 54 55

RabbitMQ Overview Dashboard

This captures all nodes across both clusters. Notice the more stable & abundant memory after this patch. Notice the disk space usage after this patch: it gets freed up periodically, but only shrinks before this patch. Messages routed to queues is consistently 50% higher after this patch (the node with this patch is the red one in that graph).

Screenshot 2021-04-02 at 11 02 22

Erlang Allocators Dashboard

Before:
Screenshot 2021-04-02 at 10 58 14

After:
Screenshot 2021-04-02 at 10 59 54

In conclusion

This patch enables nodes with many classic lazy queues, as the one described in #2785, to use less memory by keeping fewer messages in memory. The 16k default remains unchanged, but setting this value to 1k results in at least 25% memory usage and a more predictable utilisation pattern, as captured above. Great job @lhoguin, let's ship it 🛳️ 🚢 🤠

Disclaimer: by setting this advanced config property on existing nodes, with queues and messages, you take full responsibility of what happens when nodes reboot. Don't do it. It's only meant to be used on fresh deployments, and never changed. If you want to know why, see @lhoguin comments above, or go straight for the code. More power to you!

cc @motmot80 via #2785
cc @Gsantomaggio @carlhoerberg via #1284

@gerhard gerhard merged commit 98f33fc into master Apr 6, 2021
@gerhard gerhard deleted the configurable-segment-entry-count branch April 6, 2021 12:03
@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

I am backporting this to v3.8.x. Do we have Bazel there @pjk25?

This makes me think that we don't:

CONFLICT (modify/delete): deps/rabbit/BUILD.bazel deleted in HEAD and modified in 98f33fcc3d (Merge pull request #2939 from rabbitmq/configurable-segment-entry-count). Version 98f33fcc3d (Merge pull request #2939 from rabbitmq/configurable-segment-entry-count) of deps/rabbit/BUILD.bazel left in tree.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Apr 6, 2021

We don't but this may be a good moment to backport most of the Bazel work done up to this point.

@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

Thanks @michaelklishin, leaving that to @pjk25 👍🏻

gerhard added a commit that referenced this pull request Apr 6, 2021
Add queue_index_segment_entry_count configuration

(cherry picked from commit 98f33fc)

Signed-off-by: Gerhard Lazu <gerhard@lazu.co.uk>
@gerhard
Copy link
Contributor

gerhard commented Apr 6, 2021

Backported to v3.8.x via 08b286c

@lhoguin
Copy link
Contributor Author

lhoguin commented Apr 6, 2021

Mind blown by #1284 having much of the problem already figured out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants