[doc][monitoring] Using existing Grafana instances. #31633

tbukic · 2023-01-12T10:44:07Z

Why are these changes needed?

Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Documentation centers .

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…pu_4x4 (#31402) The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes. Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions. To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Cade Daniel <cade@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: rickyyx rickyx@anyscale.com Why are these changes needed? With the new task backends, there is now a delay since data will be pushed to GCS async. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

The cyphon version usually gets upgraded with new python version, so this shows up in PR like #21244, removing the unnecesary pin as per #21244 (comment) I believe this was first introduced in this PR #16469. Signed-off-by: rickyyx <rickyx@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…ed (#31230) Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock. change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry. if the task manager fetches this bit it will fail the task immediately, ignoring available retry. Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> This PR fixes a bug in Discretizer Preprocessors where all of the columns in a Dataset would be transformed instead of just the specified ones. This lead to exceptions due to KeyErrors later. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…31203) When you resume training a TensorFlow model, you may need to create an unnecessary lambda to load model weights. This is because TensorflowCheckplint.get_model expects a model definition, but you may have already constructed your model. This PR improves the UX by letting users directly pass a model to get_model. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: amogkam amogkamsetty@yahoo.com Additional codeowners were added to ray/data/preprocessors package in #30367. This is a followup to add the same codeowners for preprocessors tests as well. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

MultiHotEncoder doesn't work with Arrow datasets. This PR updates the MultiHotEncoder implementation to fix the issue. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

BatchPredictor._determine_preprocessor_batch_format has a special case for Chain, where it obtains the first preprocessor contained within Chain in order to determine which transformation function to use for the dataset (as Chain doesn't implement any). However, with #29706 allowing for nested Chains, this check fails if the first preprocessor within a Chain is also a Chain. This PR removes this special case from BatchPredictor and instead subclasses _determine_transform_to_use inside Chain itself, which allows us to both deal with the nested Chain case and improve code structure. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Add the initial operator implementations. This is split out from #30903 Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

If a trial reports a metric value of `np.nan` or similar (which is not vanilla Python `None`), any comparison made between that value and any other value will return false, thus leading to the `nan` value being considered the best value for the purpose of determining best trial in `TuneReporterBase` if it was reported first. ``` Trial 1 -> np.nan Trial 2 -> 1 1 > np.nan == False -> np.nan is the best trial ``` To guard against that issue, we use `pd.isnull` check which is more general than the `is None` check used previously. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

This is a small change. Our current contribution guide doesn't mention how to propose features / larger changes. I spotted the issue when processing https://discuss.ray.io/t/using-ray-over-infiniband/5883/9 Signed-off-by: Zhe Zhang <zhz@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…31423) This PR is a follow-up to #31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

The issue seems to have been caused by Ray tasks / actors being sometimes kept alive between fit calls before garbage collection kicks in to kill them. This caused the ray.available_resources() call in TunerInternal._maybe_warn_resource_contention to return less CPUs available than expected by the test. This has been fixed by explicitly calling gc.collect() between fit calls. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

… REST API (#31379) When users submit a malformed Serve config (e.g. with invalid fields) via `PUT` request, they receive a 500 error. This change makes the REST API return a 400 (unretryable) error instead. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…ys (#31076) When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Closes #30403 Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…not just ready ones." (#31445) This reverts commit 77c9df3. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: SangBin Cho <rkooo567@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

…y won't run (#31444) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

rkooo567 · 2023-01-13T03:07:49Z

There are huge conflicts. Maybe you didn't add commit on top of the mastser?

tbukic · 2023-01-13T08:56:49Z

@rkooo567 initially, I have changed only 1 file, but didn't commit with -s. This has happened after executing instructions proposed in DCO tab to fix signing off

git rebase HEAD~2 --signoff
git push --force-with-lease origin master

Should I maybe do everything from the start to make a merge look simpler?

rkooo567 · 2023-01-13T14:35:54Z

Should I maybe do everything from the start to make a merge look simpler?

Yeah that sounds better to me! You can just checkout to the master and add your change there.

tbukic · 2023-01-13T14:38:01Z

OK, thank you @rkooo567 . I'll do this again in the correct way now that I know the procedure :)

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR #31633

tbukic requested a review from a team as a code owner January 12, 2023 10:44

BukicTomislav-AVLHR and others added 29 commits January 12, 2023 12:10

Documentation about using existing Grafana instance.

60975c2

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Fixing RAY_PDB bug and adding unit test (#31376)

64297e3

Signed-off-by: Cade Daniel <cade@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[ci][nightly][core] Fix stress_test_state_api_scale (#31346)

624b8bf

Signed-off-by: rickyyx rickyx@anyscale.com Why are these changes needed? With the new task backends, there is now a delay since data will be pushed to GCS async. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] Issue 28428: QMix not working w/ GPUs. (#31299)

64d41c8

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] RLTrainer (#31191)

389ceab

Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[Docs] Remove unused env var from Tune docs (#31406)

c0cb109

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] Test models created for TorchPPORLModule (#31181)

43b517f

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[data] New executor backend [3/n]--- Add basic operators impl (#31305)

d0c24bf

Add the initial operator implementations. This is split out from #30903 Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] Fix Batch Norm Model and QMix VDN mixer flakeyness. (#31371)

6658984

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Fix AIR documentation codeowners (#31442)

136e783

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[AIR] Add TorchVisionPreprocessor (#30578)

c494f2a

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Closes #30403 Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Revert "[Core] RetryObjectInPlasmaErrors tries to fetch all objects, …

f828d5d

…not just ready ones." (#31445) This reverts commit 77c9df3. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Add a warning message. (#31294)

1a12229

Signed-off-by: SangBin Cho <rkooo567@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] Always create connectors if it's enabled. Otherwise the polic…

140e49b

…y won't run (#31444) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

[RLlib] Clean up deprecated concat_samples calls (#31391)

2e1ae70

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

tbukic requested review from avnishn, ArturNiederfahrenhorst, smorad, kouroshHakha, ericl, fishbone, stephanie-wang, suquark, scv119, clarkzinzow, jjyao, jianoaix, c21, jiaodong, bveeramani, edoakes, thomasdesr, AmeerHajAli, robertnishihara and raulchen as code owners January 12, 2023 11:14

rkooo567 self-assigned this Jan 13, 2023

tbukic closed this Jan 13, 2023

tbukic mentioned this pull request Jan 13, 2023

Documentation about using existing Grafana instance. #31667

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][monitoring] Using existing Grafana instances. #31633

[doc][monitoring] Using existing Grafana instances. #31633

tbukic commented Jan 12, 2023

rkooo567 commented Jan 13, 2023

tbukic commented Jan 13, 2023

rkooo567 commented Jan 13, 2023

tbukic commented Jan 13, 2023

[doc][monitoring] Using existing Grafana instances. #31633

[doc][monitoring] Using existing Grafana instances. #31633

Conversation

tbukic commented Jan 12, 2023

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Jan 13, 2023

tbukic commented Jan 13, 2023

rkooo567 commented Jan 13, 2023

tbukic commented Jan 13, 2023