Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][monitoring] Using existing Grafana instances. #31633

Closed
wants to merge 109 commits into from
Closed

[doc][monitoring] Using existing Grafana instances. #31633

wants to merge 109 commits into from

Conversation

tbukic
Copy link
Contributor

@tbukic tbukic commented Jan 12, 2023

Why are these changes needed?

Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Documentation centers .

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@tbukic tbukic requested a review from a team as a code owner January 12, 2023 10:44
BukicTomislav-AVLHR and others added 29 commits January 12, 2023 12:10
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…pu_4x4 (#31402)

The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes.

Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions.

To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton).

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: rickyyx rickyx@anyscale.com

Why are these changes needed?
With the new task backends, there is now a delay since data will be pushed to GCS async.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
The cyphon version usually gets upgraded with new python version, so this shows up in PR like #21244, removing the unnecesary pin as per #21244 (comment)

I believe this was first introduced in this PR #16469.
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…ed (#31230)

Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock.

change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry.
if the task manager fetches this bit it will fail the task immediately, ignoring available retry.

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>

This PR fixes a bug in Discretizer Preprocessors where all of the columns in a Dataset would be transformed instead of just the specified ones. This lead to exceptions due to KeyErrors later.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…31203)

When you resume training a TensorFlow model, you may need to create an unnecessary lambda to load model weights. This is because TensorflowCheckplint.get_model expects a model definition, but you may have already constructed your model. This PR improves the UX by letting users directly pass a model to get_model.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: amogkam amogkamsetty@yahoo.com

Additional codeowners were added to ray/data/preprocessors package in #30367.

This is a followup to add the same codeowners for preprocessors tests as well.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
MultiHotEncoder doesn't work with Arrow datasets. This PR updates the MultiHotEncoder implementation to fix the issue.

Signed-off-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
BatchPredictor._determine_preprocessor_batch_format has a special case for Chain, where it obtains the first preprocessor contained within Chain in order to determine which transformation function to use for the dataset (as Chain doesn't implement any). However, with #29706 allowing for nested Chains, this check fails if the first preprocessor within a Chain is also a Chain.

This PR removes this special case from BatchPredictor and instead subclasses _determine_transform_to_use inside Chain itself, which allows us to both deal with the nested Chain case and improve code structure.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Add the initial operator implementations.

This is split out from #30903

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
If a trial reports a metric value of `np.nan` or similar (which is not vanilla Python `None`), any comparison made between that value and any other value will return false, thus leading to the `nan` value being considered the best value for the purpose of determining best trial in `TuneReporterBase` if it was reported first.

```
Trial 1 -> np.nan
Trial 2 -> 1

1 > np.nan == False -> np.nan is the best trial
```

To guard against that issue, we use `pd.isnull` check which is more general than the `is None` check used previously.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
This is a small change. Our current contribution guide doesn't mention how to propose features / larger changes. I spotted the issue when processing https://discuss.ray.io/t/using-ray-over-infiniband/5883/9

Signed-off-by: Zhe Zhang <zhz@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…31423)

This PR is a follow-up to #31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
The issue seems to have been caused by Ray tasks / actors being sometimes kept alive between fit calls before garbage collection kicks in to kill them. This caused the ray.available_resources() call in TunerInternal._maybe_warn_resource_contention to return less CPUs available than expected by the test. This has been fixed by explicitly calling gc.collect() between fit calls.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
… REST API (#31379)

When users submit a malformed Serve config (e.g. with invalid fields) via `PUT` request, they receive a 500 error. This change makes the REST API return a 400 (unretryable) error instead.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…ys (#31076)

When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Closes #30403

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…not just ready ones." (#31445)

This reverts commit 77c9df3.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…y won't run (#31444)

Signed-off-by: Jun Gong <jungong@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
@rkooo567
Copy link
Contributor

There are huge conflicts. Maybe you didn't add commit on top of the mastser?

@rkooo567 rkooo567 self-assigned this Jan 13, 2023
@tbukic
Copy link
Contributor Author

tbukic commented Jan 13, 2023

@rkooo567 initially, I have changed only 1 file, but didn't commit with -s. This has happened after executing instructions proposed in DCO tab to fix signing off

git rebase HEAD~2 --signoff
git push --force-with-lease origin master

Should I maybe do everything from the start to make a merge look simpler?

@rkooo567
Copy link
Contributor

Should I maybe do everything from the start to make a merge look simpler?

Yeah that sounds better to me! You can just checkout to the master and add your change there.

@tbukic
Copy link
Contributor Author

tbukic commented Jan 13, 2023

OK, thank you @rkooo567 . I'll do this again in the correct way now that I know the procedure :)

@tbukic tbukic closed this Jan 13, 2023
rkooo567 pushed a commit that referenced this pull request Jan 23, 2023
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel.

Fixed version of PR #31633
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.