-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc][monitoring] Using existing Grafana instances. #31633
Conversation
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…pu_4x4 (#31402) The benchmark currently times out, because a single run takes over 16 minutes. This is a regression compared to e.g. the 2.0.0 release, where a run took only 4 minutes. Upon closer investigation, this seems to be related to severe underutilization of the GPU. With a small batch size, we are bound by compute/data iteration, and this seems to have been regressed in later tensorflow versions. To achieve shorter training times, we increase the global batch size from 64 (which is tiny) to 1024. This severely speeds up training (even though the GPUs are still underutilized with <10% utilizaton). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Cade Daniel <cade@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: rickyyx rickyx@anyscale.com Why are these changes needed? With the new task backends, there is now a delay since data will be pushed to GCS async. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
The cyphon version usually gets upgraded with new python version, so this shows up in PR like #21244, removing the unnecesary pin as per #21244 (comment) I believe this was first introduced in this PR #16469. Signed-off-by: rickyyx <rickyx@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…ed (#31230) Add plumbing for raylet to inform whether the worker that died should be retried. This is a no-op change for now, to be used in a follow up PR once we implement group-by-owner-id policy that will report task failure on deadlock. change oom killer to expose a bit on whether to retry the task. A future policy that can detect deadlock will set this bit to false. For now, the current policy will set this bit to true and allow the task to use its retry. if the task manager fetches this bit it will fail the task immediately, ignoring available retry. Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> This PR fixes a bug in Discretizer Preprocessors where all of the columns in a Dataset would be transformed instead of just the specified ones. This lead to exceptions due to KeyErrors later. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Avnish <avnishnarayan@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…31203) When you resume training a TensorFlow model, you may need to create an unnecessary lambda to load model weights. This is because TensorflowCheckplint.get_model expects a model definition, but you may have already constructed your model. This PR improves the UX by letting users directly pass a model to get_model. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: amogkam amogkamsetty@yahoo.com Additional codeowners were added to ray/data/preprocessors package in #30367. This is a followup to add the same codeowners for preprocessors tests as well. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
MultiHotEncoder doesn't work with Arrow datasets. This PR updates the MultiHotEncoder implementation to fix the issue. Signed-off-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
BatchPredictor._determine_preprocessor_batch_format has a special case for Chain, where it obtains the first preprocessor contained within Chain in order to determine which transformation function to use for the dataset (as Chain doesn't implement any). However, with #29706 allowing for nested Chains, this check fails if the first preprocessor within a Chain is also a Chain. This PR removes this special case from BatchPredictor and instead subclasses _determine_transform_to_use inside Chain itself, which allows us to both deal with the nested Chain case and improve code structure. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Add the initial operator implementations. This is split out from #30903 Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
If a trial reports a metric value of `np.nan` or similar (which is not vanilla Python `None`), any comparison made between that value and any other value will return false, thus leading to the `nan` value being considered the best value for the purpose of determining best trial in `TuneReporterBase` if it was reported first. ``` Trial 1 -> np.nan Trial 2 -> 1 1 > np.nan == False -> np.nan is the best trial ``` To guard against that issue, we use `pd.isnull` check which is more general than the `is None` check used previously. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
This is a small change. Our current contribution guide doesn't mention how to propose features / larger changes. I spotted the issue when processing https://discuss.ray.io/t/using-ray-over-infiniband/5883/9 Signed-off-by: Zhe Zhang <zhz@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…31423) This PR is a follow-up to #31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
The issue seems to have been caused by Ray tasks / actors being sometimes kept alive between fit calls before garbage collection kicks in to kill them. This caused the ray.available_resources() call in TunerInternal._maybe_warn_resource_contention to return less CPUs available than expected by the test. This has been fixed by explicitly calling gc.collect() between fit calls. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
… REST API (#31379) When users submit a malformed Serve config (e.g. with invalid fields) via `PUT` request, they receive a 500 error. This change makes the REST API return a 400 (unretryable) error instead. Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…ys (#31076) When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses ArrowTensorArray instead of ArrowVariableShapedTensorArray as the underlying schema type. This PR refactors existing logic for schema unification into a separate function, which is now called during Arrow table concatenation and schema fetching to correct type promotion involving ragged arrays. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
…y won't run (#31444) Signed-off-by: Jun Gong <jungong@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>
There are huge conflicts. Maybe you didn't add commit on top of the mastser? |
@rkooo567 initially, I have changed only 1 file, but didn't commit with
Should I maybe do everything from the start to make a merge look simpler? |
Yeah that sounds better to me! You can just checkout to the master and add your change there. |
OK, thank you @rkooo567 . I'll do this again in the correct way now that I know the procedure :) |
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR #31633
Why are these changes needed?
Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Documentation centers .
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.