Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dev comments + reset logging levels #13

Closed
wants to merge 118 commits into from
Closed

Conversation

glennmoy
Copy link

@glennmoy glennmoy commented Oct 3, 2023

No description provided.

ArturNiederfahrenhorst and others added 30 commits May 16, 2023 18:14
Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
* Update version in dask on ray guide for 2.5.0 release (ray-project#35458)

As part of the 2.5.0 release, the dask on ray version in the guide needs to be updated.

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [Dask on Ray] Attempt to fix line in dask doc (ray-project#35479)

ray-project#35458 introduced an issue with the table not being displayed.

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
…test (ray-project#35465) (ray-project#35489)

The release tests failed due to the incompatible urllib3 version. Pin urllib < 1.27 to fix the ml_user_ray_lightning_user_test_(master|latest).aws release test.

Signed-off-by: woshiyyya <yunxuanx@anyscale.com>
* [serve] Shutdown http proxy state (ray-project#35395)

Shutdown http proxy state so that it won't run anything in its update loop once a shutdown signal is received.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

* [serve] Remove print statement + fix lint (ray-project#35439)

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

---------

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
…ect#35515)

Missing import.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
… (ray-project#35488)

- Organize HuggingFace integrations together.
   - Additional HuggingFace integrations should have a logical place to be added.
- Make it simple for users to import and use integrations.
   - Imports should not be excessively long
   - Naming should be intuitive

Signed-off-by: Matthew Deng <matt@anyscale.com>
…roject#35399) (ray-project#35464)

- Support handle.options(multiplexed_model_id="")
- Http proxy to extract model id on the fly
- Choose correct replica based on information.
- nit: Move handle metrics pusher to router.
)

## Why are these changes needed?
<!-- Please give a short summary of the change and the problem this solves. -->
After ray-project#35123 the request in ray is serialized and we should be able to retry the failed redis request in ray.

This PR refactor the RedisContext a little bit by remove the CallbackItem and introduce the RequestContext. In side the RequestContext, the failed request will be retried automatically.

If still failed in the end, it'll just crash.

## Related issue number
ray-project#34014
…ay-project#34726) (ray-project#35542)

* [core] Graceful handling of returning bundles when node is removed (ray-project#34726)

When a node is dead, we did a series of clean up for the node. There was failure like this where a node was already removed from the internal data structures (i.e. ClusterResourceManager.nodes_) when trying to clean up bundles from the node.

* lint

Signed-off-by: rickyyx <rickyx@anyscale.com>

---------

Signed-off-by: rickyyx <rickyx@anyscale.com>
…ay-project#35525)

From ray-project#35143, we found a map() case that is not covered in our numpy support test cases.
…5414) (ray-project#35483)

* [Data/Train] Fix ipython representation (ray-project#35414)

Fixes a bug where repr would fail on ipython shell

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

* format

Signed-off-by: amogkam <amogkamsetty@yahoo.com>

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
…ject#35552)

Looking at test_torch_predictor, it is a suite of 35 test cases, and runs roughly a minute of time.
Most of the time these tests do finish fine actually. That's why it's just flaky.
Make this a medium test to de-flake it.

Signed-off-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: Jun Gong <jungong@anyscale.com>
…35534) (ray-project#35554)

test_multiprocessing_client_mode is very flaky on Windows. This PR skips it on Windows.

Related issue number
Closes ray-project#35526

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…zation. (ray-project#35494) (ray-project#35532)

When an async actor is used, we always increase the max recursion limit before we post the function to the event loop because when we have lots of pending async tasks, Python thinks there's a recursion due to a large parallel callstacks (due to fiber which is used to implement async actor).

When running an async task, we have 3 steps.

run a deserialize function in the event loop,
ray/python/ray/_raylet.pyx

Line 866 in bfec451

 args = core_worker.run_async_func_in_event_loop( 
increase recursion limit,
ray/python/ray/_raylet.pyx

Line 831 in bfec451

 increase_recursion_limit() 
run a main function in the event loop
The problem here is that you increase the limit always "after" deserializing the object from the event loop. This means when the deserialization happens, the recursion limit is still low, and this can cause the exception.

When we have a lots of async tasks that has higher overhead of input deserialization, this can happen (because before we increase the recursion limit, we hit the max recursion error when deserializing the object), which is exactly what 1:1 async-actor calls with args async test does where the microbenchmark failed with the recursion error.

This PR fixes the issue by moving increase_recursion_limit inside the run_async_func_in_event_loop, so whenever we post a new async task, the recursion limit is always checked and increased.

I found the same issue when I developed a generator, and I verified this fixes the issue in this PR ray-project#35425 (comment).

I am not sure how to test this. @scv119 do you have the consistent repro that can run in unit tests? Or can you verify it using this branch as well?
…s to map functions"" (ray-project#35505) (ray-project#35527)

Reverts ray-project#35504

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
…y-project#35520) (ray-project#35570)

This change adds a dropdown explaining why Ray Serve is a good fit for LLM developers.

Link: https://anyscale-ray--35520.com.readthedocs.build/en/35520/serve/index.html#how-can-serve-help-me-as-a

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
We should build jar for release on manylinux2014 to resolve incompatibilities. Refer to https://discuss.ray.io/t/java-glibc-issue-during-ray-init-call/10407.
Co-authored-by: Guyang Song <guyang.sgy@gmail.com>

Co-authored-by: Candy Lv <90018431+XiaodongLv@users.noreply.github.com>
Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
…to be fetched in release tests and and CI learning tests (ray-project#35588)

* [RLlib] Fit ES and ARS results dict to rest of RLlib, enable results to be fetched in release tests and and CI learning tests (ray-project#35533)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

* [RLlib] Fix ARS release test (ray-project#35608)

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

---------

Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
…ject#35637)

* [Data] Add batch inference object detection example (ray-project#35143)

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* Fix object detection example test

Signed-off-by: Hao Chen <chenh1024@gmail.com>

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
…ject#35291) (ray-project#35656)

This PR introduces TaskManager interfaces to enable streaming generator.
…r interface. (ray-project#35324) (ray-project#35682)

This is the second PR to support streaming generator.

The detailed design and API proposal can be found from https://docs.google.com/document/d/1hAASLe2sCoay23raqxqwJdSDiJWNMcNhlTwWJXsJOU4/edit#heading=h.w91y1fgnpu0m.
The Execution plan can be found from https://docs.google.com/document/d/1hAASLe2sCoay23raqxqwJdSDiJWNMcNhlTwWJXsJOU4/edit#heading=h.kxktymq5ihf7.
There will be 4 PRs to enable streaming generator for Ray Serve (phase 1).

 This PR -> introduce cpp interfaces to handle intermediate task return [1/N] Streaming Generator. Cpp interfaces and implementation ray-project#35291
 Support core worker APIs + cython generator interface. [2/N] Streaming Generator. Support core worker APIs + cython generator interface. ray-project#35324 < --- this PR
 E2e integration [3/N] Streaming Generator. E2e integration ray-project#35325 (review)
 Support async actors
This PR implements the Cython generator interface that users can use to obtain a next available object reference.
---------

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
…ray-project#35673)

## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

ray-project#35632
ray-project#35310
ray-project#35683) (ray-project#35718)

In order to guarantee that we put ActorTaskSpecTable before ActorTable, we should put ActorTable inside the ActorTaskSpecTable put callback. Otherwise, Redis may receive ActorTable put before ActorTaskSpecTable put. If we crash in the middle, we may end up with actor data inside ActorTable but not ActorTaskSpecTable.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
…oject#35638) (ray-project#35703)

Closes ray-project#35586
See ray-project#35586 (comment)

Numpy treats variable length byte data as zero-terminated bytes. So if there are zero bytes encoded into the bytestring itself, those will be discarded.

Instead, per recommendation in apache/arrow#26470, it seems that variable length bytes should be treated as python objects.

---------

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
kleinschmidt and others added 29 commits August 2, 2023 16:52
…ptor-fix

Support JuliaFunctionDescriptor in `==` fallback
Enable debug-level logging on the Ray backend
Signed-off-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>
Spawn a Julia worker via the Raylet
* Support specifying runtime env executable

* Support specifying runtime env args

* Support specifying executable/args via runtime env

* Avoid quoting default

* Switch to using command in RuntimeEnvContext

* Use separate command for Julia

* Add TODO about switching to plugin

Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>
Signed-off-by: Curtis Vogt <curtis.vogt@gmail.com>

---------

Signed-off-by: Curtis Vogt <curtis.vogt@gmail.com>
Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>
@glennmoy glennmoy closed this Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.