[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_CLEANUP=false` #40949

rueian · 2023-11-04T03:55:41Z

Why are these changes needed?

As discussed with @kevin85421, it would be better if we could provide a guide as well as a warning in the documentation about using Redis native eviction instead of KubeRay's Redis cleanup.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kevin85421

Would you mind providing a YAML file in your PR description and sharing more details about the expected behavior?

kevin85421 · 2023-11-06T18:05:59Z

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

+* `maxmemory=<your_memory_limit>`
+* `maxmemory-policy=allkeys-lru`
+
+These two options instruct Redis to delete least recently used keys when it reaches the `maxmemory` limit.


I am not familiar with Redis. What's the definition of "used keys" in Redis? Will the timestamp be updated for a running RayCluster?

Yes, the access timestamp will be updated when any redis operation touches the key.

A RayCluster stores all metadata into one Redis key and keeps updating it, therefore, setting maxmemory-policy=allkeys-lru will make a running RayCluster to be less likely evicted by redis.

rueian · 2023-11-07T16:14:07Z

Would you mind providing a YAML file in your PR description and sharing more details about the expected behavior?

Sure, I have recorded an experiment based on the example here https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#attempt-2-autoscale-driver. I will come up with an example YAML later.

To better demonstrate the expected behavior, I first wrote a small ray program to figure out how to fill up GCS usage.

Based on my observation from redis side, Ray GCS will store all information in one hash set, and among all hash members, the <namespace>@KV:@namespace_fun:ActorClass:* can take most of the redis memory usage since they are pickled from actor definitions.

I used the following program to verify the idea:

import os
import ray
import redis

def new_actor(n):
    @ray.remote(num_cpus=0)
    class MyActor:
        data = bytes(n)
        
    return MyActor

if __name__ == "__main__":
    redis_address = os.getenv("RAY_REDIS_ADDRESS") # ex. redis://localhost:6379
    redis_client = redis.from_url(redis_address)

    ray.init()

    print(redis_client.memory_usage("default", 0))
    for _ in range(30):
        actor = new_actor(1024**2)
        for _ in range(100):
            actor.remote()
            print(redis_client.memory_usage("default", 0))

This program defined 30 actors and each of them takes about 1MB. And then started 100 replicas for each actor. It printed out redis memory usage after each replica was registered with actor.remote(). The usage result was:

As you can see from the plot, the memory usage jumps 1MB up whenever a new actor definition is registered. This result shows that users should take their actor definitions, not the number of replicas, into consideration when they want to estimate how much redis memory they have to have.

Expected behavior of `maxmemory-policy=allkeys-lru`

Then, I will use the following yaml to demonstrate the expected behavior:

apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:latest
          command:
            - "redis-server"
            - "--bind"
            - "0.0.0.0"
            - "--port"
            - "6379"
            - "--protected-mode"
            - "no"
            - "--maxmemory"
            - "60mb"
            - "--maxmemory-policy"
            - "allkeys-lru"
          ports:
            - containerPort: 6379
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    ray.io/ft-enabled: "true"
    ray.io/external-storage-namespace: "raycluster-1"
  name: raycluster-1
spec:
  rayVersion: '2.7.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.7.0
            env:
              - name: RAY_REDIS_ADDRESS
                value: redis://redis:6379
            volumeMounts:
              - mountPath: /home/ray/samples
                name: ray-samples
        volumes:
          - name: ray-samples
            configMap:
              name: ray-samples
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-samples
data:
  make_actors_30MB.py: |
    import ray

    def new_actor(n):
        @ray.remote(num_cpus=0)
        class MyActor:
            data = bytes(n)
        return MyActor

    ray.init(namespace="default_namespace")
    [new_actor(1024**2).options(name=str(i), lifetime="detached").remote() for i in range(30)]

This yaml will spin up a redis with maxmemory=60mb and maxmemory-policy=allkeys-lru, and a raycluster-1 mounted with a make_actors_30MB.py modified from the previous program.

Here are the demonstration procedures:

Apply the above yaml to a kuberay operator whose ENABLE_GCS_FT_REDIS_CLEANUP=false
Fill redis up by running
kubectl exec -it raycluster-1-head-oooxx -- python /home/ray/samples/make_actors_30MB.py
Delete the cluster by running
kubectl delete raycluster raycluster-1
Print out the redis memory usage of raycluster-1. It should be about ~39MB.
kubectl exec -it deploy/redis -- redis-cli MEMORY USAGE raycluster-1 SAMPLES 0
Replace all the raycluster-1 in the yaml to raycluster-2, and apply it again to the kuberay operator.
Fill redis up again by running the same program on the new cluster
kubectl exec -it raycluster-2-head-oooxx -- python /home/ray/samples/make_actors_30MB.py
Print out the redis memory usage of raycluster-1 again. It should be nil now, because it should be evicted by the maxmemory-policy=allkeys-lru.

kevin85421

Great! Thank you for the detailed explanations!

angelinalg

Made some comments to improve clarity. Let me know if you have questions.

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

angelinalg · 2023-11-15T23:32:43Z

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

@@ -310,6 +310,20 @@ Refer to [this section](kuberay-external-storage-namespace-example) in the earli

 * `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

+```{admonition} Setup Key Eviction on Redis
+If users turn `ENABLE_GCS_FT_REDIS_CLEANUP` off but still want GCS metadata to be removed automatically,


Suggested change

If users turn `ENABLE_GCS_FT_REDIS_CLEANUP` off but still want GCS metadata to be removed automatically,

If you disable `ENABLE_GCS_FT_REDIS_CLEANUP` but still want Redis to remove GCS metadata automatically,

Is this what you mean?

Yes, it is. Thank you for clarification and I think removing the word “still” will be better.

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

angelinalg · 2023-11-15T23:34:32Z

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

@@ -310,6 +310,20 @@ Refer to [this section](kuberay-external-storage-namespace-example) in the earli

 * `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).


Suggested change

* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

* `ENABLE_GCS_FT_REDIS_CLEANUP`: True by default. You can turn this feature off by setting the environment variable in the [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

What is the difference between disabling this feature and turning it off by setting the environment variable in the Helm chart?

Setting the environment variable in the Helm chart is the only way to disable the feature if you deploy kuberay with helm.

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

…CLEANUP=false` Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian · 2023-11-16T11:50:14Z

Made some comments to improve clarity. Let me know if you have questions.

All suggestions applied. Thank you @angelinalg!

kevin85421 · 2023-11-16T17:45:38Z

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

architkulkarni

Looks good to me, I'm fine with merging this!

architkulkarni · 2023-11-16T18:13:24Z

doc/source/cluster/kubernetes/user-guides/kuberay-gcs-ft.md

+set these two options on Redis:
+
+* `maxmemory=<your_memory_limit>`
+* `maxmemory-policy=allkeys-lru`
+


Nit: Where exactly do you write these options? (Is it redis.conf?) It might be good to be totally explicit, for beginners.

You can store the configurations in a ConfigMap and then start the Redis server using that config file (example).

You can also directly specify the options in the container command. See [Doc][KubeRay]: Redis eviction suggestions when ENABLE_GCS_FT_REDIS_CLEANUP=false #40949 (comment) as an example.

Maybe @rueian can add a link to #40949 (comment) in the PR?

Thanks for the info! I think this is important to include then. (I think a nontrivial fraction of users may get stuck/confused without this info)

architkulkarni · 2023-11-16T18:18:47Z

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

I guess this is the warning you're talking about: https://buildkite.com/ray-project/premerge/builds/11949#018bd7f5-dc11-4198-8ba4-e9e9b175a5ad/6-102

�_bk;t=1700137490258�WARNING: failed to reach any of the inventories with the following issues:

�_bk;t=1700137490258�intersphinx inventory 'https://docs.scipy.org/doc/scipy/objects.inv' not fetchable due to <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='docs.scipy.org', port=443): Max retries exceeded with url: /doc/scipy/objects.inv (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)')))

I'm guessing it's a transient error, because premerge seems to be passing on master. I'll merge master to restart CI

Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian · 2023-11-17T04:15:12Z

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

I guess this is the warning you're talking about: https://buildkite.com/ray-project/premerge/builds/11949#018bd7f5-dc11-4198-8ba4-e9e9b175a5ad/6-102
�_bk;t=1700137490258�WARNING: failed to reach any of the inventories with the following issues:

�_bk;t=1700137490258�intersphinx inventory 'https://docs.scipy.org/doc/scipy/objects.inv' not fetchable due to <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='docs.scipy.org', port=443): Max retries exceeded with url: /doc/scipy/objects.inv (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)')))
I'm guessing it's a transient error, because premerge seems to be passing on master. I'll merge master to restart CI

Hi @architkulkarni, thank you for merging the master branch. And your suggestion is also applied.

@kevin85421

…CLEANUP=false` (ray-project#40949) As discussed with @kevin85421, it would be better if we could provide a guide as well as a warning in the documentation about using Redis native eviction instead of KubeRay's Redis cleanup. --------- Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>

rueian requested review from architkulkarni, maxpumperla, pcmoritz, kevin85421 and a team as code owners November 4, 2023 03:55

kevin85421 reviewed Nov 6, 2023

View reviewed changes

kevin85421 self-assigned this Nov 15, 2023

kevin85421 approved these changes Nov 15, 2023

View reviewed changes

kevin85421 assigned architkulkarni and angelinalg Nov 15, 2023

angelinalg approved these changes Nov 15, 2023

View reviewed changes

[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_…

7ab0a7a

…CLEANUP=false` Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian force-pushed the doc-kuberay-redis-eviction branch from a61d2dd to 7ab0a7a Compare November 16, 2023 11:46

architkulkarni approved these changes Nov 16, 2023

View reviewed changes

Merge branch 'master' into doc-kuberay-redis-eviction

bc07b1f

rueian force-pushed the doc-kuberay-redis-eviction branch from bc07b1f to 63723db Compare November 16, 2023 23:38

[Doc][KubeRay]: Redis Eviction - clarify where to set the configs

4fa56b9

Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian force-pushed the doc-kuberay-redis-eviction branch from 63723db to 4fa56b9 Compare November 17, 2023 00:10

architkulkarni merged commit 9a34839 into ray-project:master Nov 17, 2023
2 checks passed

rickyyx mentioned this pull request Nov 22, 2023

[ci][core] Perf regression on tasks_per_second, pgs_per_second #41338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_CLEANUP=false` #40949

[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_CLEANUP=false` #40949

rueian commented Nov 4, 2023

kevin85421 left a comment

kevin85421 Nov 6, 2023

rueian Nov 7, 2023

rueian commented Nov 7, 2023 •

edited

Loading

kevin85421 left a comment

angelinalg left a comment

angelinalg Nov 15, 2023

angelinalg Nov 15, 2023

rueian Nov 16, 2023

angelinalg Nov 15, 2023

angelinalg Nov 15, 2023

rueian Nov 16, 2023

rueian commented Nov 16, 2023

kevin85421 commented Nov 16, 2023

architkulkarni left a comment

architkulkarni Nov 16, 2023

kevin85421 Nov 16, 2023

kevin85421 Nov 16, 2023

architkulkarni Nov 16, 2023

architkulkarni commented Nov 16, 2023

rueian commented Nov 17, 2023

	If users turn `ENABLE_GCS_FT_REDIS_CLEANUP` off but still want GCS metadata to be removed automatically,
	If you disable `ENABLE_GCS_FT_REDIS_CLEANUP` but still want Redis to remove GCS metadata automatically,

		@@ -310,6 +310,20 @@ Refer to [this section](kuberay-external-storage-namespace-example) in the earli

		* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

	* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).
	* `ENABLE_GCS_FT_REDIS_CLEANUP`: True by default. You can turn this feature off by setting the environment variable in the [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

[Doc][KubeRay]: Redis eviction suggestions when ENABLE_GCS_FT_REDIS_CLEANUP=false #40949

[Doc][KubeRay]: Redis eviction suggestions when ENABLE_GCS_FT_REDIS_CLEANUP=false #40949

Conversation

rueian commented Nov 4, 2023

Why are these changes needed?

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rueian commented Nov 7, 2023 • edited Loading

Expected behavior of maxmemory-policy=allkeys-lru

kevin85421 left a comment

Choose a reason for hiding this comment

angelinalg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rueian commented Nov 16, 2023

kevin85421 commented Nov 16, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

architkulkarni commented Nov 16, 2023

rueian commented Nov 17, 2023

[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_CLEANUP=false` #40949

[Doc][KubeRay]: Redis eviction suggestions when `ENABLE_GCS_FT_REDIS_CLEANUP=false` #40949

rueian commented Nov 7, 2023 •

edited

Loading

Expected behavior of `maxmemory-policy=allkeys-lru`