Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc][KubeRay]: Redis eviction suggestions when ENABLE_GCS_FT_REDIS_CLEANUP=false #40949

Merged
merged 3 commits into from
Nov 17, 2023

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Nov 4, 2023

Why are these changes needed?

As discussed with @kevin85421, it would be better if we could provide a guide as well as a warning in the documentation about using Redis native eviction instead of KubeRay's Redis cleanup.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind providing a YAML file in your PR description and sharing more details about the expected behavior?

* `maxmemory=<your_memory_limit>`
* `maxmemory-policy=allkeys-lru`

These two options instruct Redis to delete least recently used keys when it reaches the `maxmemory` limit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with Redis. What's the definition of "used keys" in Redis? Will the timestamp be updated for a running RayCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the access timestamp will be updated when any redis operation touches the key.

A RayCluster stores all metadata into one Redis key and keeps updating it, therefore, setting maxmemory-policy=allkeys-lru will make a running RayCluster to be less likely evicted by redis.

@rueian
Copy link
Contributor Author

rueian commented Nov 7, 2023

Would you mind providing a YAML file in your PR description and sharing more details about the expected behavior?

Sure, I have recorded an experiment based on the example here https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#attempt-2-autoscale-driver. I will come up with an example YAML later.

To better demonstrate the expected behavior, I first wrote a small ray program to figure out how to fill up GCS usage.

Based on my observation from redis side, Ray GCS will store all information in one hash set, and among all hash members, the <namespace>@KV:@namespace_fun:ActorClass:* can take most of the redis memory usage since they are pickled from actor definitions.

I used the following program to verify the idea:

import os
import ray
import redis

def new_actor(n):
    @ray.remote(num_cpus=0)
    class MyActor:
        data = bytes(n)
        
    return MyActor

if __name__ == "__main__":
    redis_address = os.getenv("RAY_REDIS_ADDRESS") # ex. redis://localhost:6379
    redis_client = redis.from_url(redis_address)

    ray.init()

    print(redis_client.memory_usage("default", 0))
    for _ in range(30):
        actor = new_actor(1024**2)
        for _ in range(100):
            actor.remote()
            print(redis_client.memory_usage("default", 0))

This program defined 30 actors and each of them takes about 1MB. And then started 100 replicas for each actor. It printed out redis memory usage after each replica was registered with actor.remote(). The usage result was:
image

As you can see from the plot, the memory usage jumps 1MB up whenever a new actor definition is registered. This result shows that users should take their actor definitions, not the number of replicas, into consideration when they want to estimate how much redis memory they have to have.

Expected behavior of maxmemory-policy=allkeys-lru

Then, I will use the following yaml to demonstrate the expected behavior:

apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:latest
          command:
            - "redis-server"
            - "--bind"
            - "0.0.0.0"
            - "--port"
            - "6379"
            - "--protected-mode"
            - "no"
            - "--maxmemory"
            - "60mb"
            - "--maxmemory-policy"
            - "allkeys-lru"
          ports:
            - containerPort: 6379
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    ray.io/ft-enabled: "true"
    ray.io/external-storage-namespace: "raycluster-1"
  name: raycluster-1
spec:
  rayVersion: '2.7.0'
  headGroupSpec:
    rayStartParams: {}
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.7.0
            env:
              - name: RAY_REDIS_ADDRESS
                value: redis://redis:6379
            volumeMounts:
              - mountPath: /home/ray/samples
                name: ray-samples
        volumes:
          - name: ray-samples
            configMap:
              name: ray-samples
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-samples
data:
  make_actors_30MB.py: |
    import ray

    def new_actor(n):
        @ray.remote(num_cpus=0)
        class MyActor:
            data = bytes(n)
        return MyActor

    ray.init(namespace="default_namespace")
    [new_actor(1024**2).options(name=str(i), lifetime="detached").remote() for i in range(30)]

This yaml will spin up a redis with maxmemory=60mb and maxmemory-policy=allkeys-lru, and a raycluster-1 mounted with a make_actors_30MB.py modified from the previous program.

Here are the demonstration procedures:

  1. Apply the above yaml to a kuberay operator whose ENABLE_GCS_FT_REDIS_CLEANUP=false
  2. Fill redis up by running
    kubectl exec -it raycluster-1-head-oooxx -- python /home/ray/samples/make_actors_30MB.py
  3. Delete the cluster by running
    kubectl delete raycluster raycluster-1
  4. Print out the redis memory usage of raycluster-1. It should be about ~39MB.
    kubectl exec -it deploy/redis -- redis-cli MEMORY USAGE raycluster-1 SAMPLES 0
  5. Replace all the raycluster-1 in the yaml to raycluster-2, and apply it again to the kuberay operator.
  6. Fill redis up again by running the same program on the new cluster
    kubectl exec -it raycluster-2-head-oooxx -- python /home/ray/samples/make_actors_30MB.py
  7. Print out the redis memory usage of raycluster-1 again. It should be nil now, because it should be evicted by the maxmemory-policy=allkeys-lru.

@kevin85421 kevin85421 self-assigned this Nov 15, 2023
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for the detailed explanations!

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments to improve clarity. Let me know if you have questions.

@@ -310,6 +310,20 @@ Refer to [this section](kuberay-external-storage-namespace-example) in the earli

* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

```{admonition} Setup Key Eviction on Redis
If users turn `ENABLE_GCS_FT_REDIS_CLEANUP` off but still want GCS metadata to be removed automatically,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If users turn `ENABLE_GCS_FT_REDIS_CLEANUP` off but still want GCS metadata to be removed automatically,
If you disable `ENABLE_GCS_FT_REDIS_CLEANUP` but still want Redis to remove GCS metadata automatically,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. Thank you for clarification and I think removing the word “still” will be better.

@@ -310,6 +310,20 @@ Refer to [this section](kuberay-external-storage-namespace-example) in the earli

* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `ENABLE_GCS_FT_REDIS_CLEANUP`: The feature gate `ENABLE_GCS_FT_REDIS_CLEANUP` is true by default, and users can turn if off by setting the environment variable in [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).
* `ENABLE_GCS_FT_REDIS_CLEANUP`: True by default. You can turn this feature off by setting the environment variable in the [KubeRay operator's Helm chart](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/values.yaml).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between disabling this feature and turning it off by setting the environment variable in the Helm chart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the environment variable in the Helm chart is the only way to disable the feature if you deploy kuberay with helm.

…CLEANUP=false`

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the doc-kuberay-redis-eviction branch from a61d2dd to 7ab0a7a Compare November 16, 2023 11:46
@rueian
Copy link
Contributor Author

rueian commented Nov 16, 2023

Made some comments to improve clarity. Let me know if you have questions.

All suggestions applied. Thank you @angelinalg!

@kevin85421
Copy link
Member

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I'm fine with merging this!

Comment on lines 315 to 319
set these two options on Redis:

* `maxmemory=<your_memory_limit>`
* `maxmemory-policy=allkeys-lru`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Where exactly do you write these options? (Is it redis.conf?) It might be good to be totally explicit, for beginners.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @rueian can add a link to #40949 (comment) in the PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! I think this is important to include then. (I think a nontrivial fraction of users may get stuck/confused without this info)

@architkulkarni
Copy link
Contributor

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

I guess this is the warning you're talking about: https://buildkite.com/ray-project/premerge/builds/11949#018bd7f5-dc11-4198-8ba4-e9e9b175a5ad/6-102

�_bk;t=1700137490258�WARNING: failed to reach any of the inventories with the following issues:

�_bk;t=1700137490258�intersphinx inventory 'https://docs.scipy.org/doc/scipy/objects.inv' not fetchable due to <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='docs.scipy.org', port=443): Max retries exceeded with url: /doc/scipy/objects.inv (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)')))

I'm guessing it's a transient error, because premerge seems to be passing on master. I'll merge master to restart CI

@rueian rueian force-pushed the doc-kuberay-redis-eviction branch from bc07b1f to 63723db Compare November 16, 2023 23:38
Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the doc-kuberay-redis-eviction branch from 63723db to 4fa56b9 Compare November 17, 2023 00:10
@rueian
Copy link
Contributor Author

rueian commented Nov 17, 2023

docs/readthedocs.com:anyscale-ray due to a warning. The warning seems to be unrelated to this PR. @angelinalg do you have any idea? Thanks!

I guess this is the warning you're talking about: https://buildkite.com/ray-project/premerge/builds/11949#018bd7f5-dc11-4198-8ba4-e9e9b175a5ad/6-102

�_bk;t=1700137490258�WARNING: failed to reach any of the inventories with the following issues:

�_bk;t=1700137490258�intersphinx inventory 'https://docs.scipy.org/doc/scipy/objects.inv' not fetchable due to <class 'requests.exceptions.SSLError'>: HTTPSConnectionPool(host='docs.scipy.org', port=443): Max retries exceeded with url: /doc/scipy/objects.inv (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)')))

I'm guessing it's a transient error, because premerge seems to be passing on master. I'll merge master to restart CI

Hi @architkulkarni, thank you for merging the master branch. And your suggestion is also applied.

@architkulkarni architkulkarni merged commit 9a34839 into ray-project:master Nov 17, 2023
2 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Nov 29, 2023
…CLEANUP=false` (ray-project#40949)

As discussed with @kevin85421, it would be better if we could provide a guide as well as a warning in the documentation about using Redis native eviction instead of KubeRay's Redis cleanup.

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants