[wandb] Have `train_loop_config` logged as a config. #31790

xwjiang2010 · 2023-01-19T23:25:29Z

Signed-off-by: xwjiang2010 xwjiang2010@gmail.com

Why are these changes needed?

Have train_loop_config logged as a config in wandb UI.

This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config.

Related issue number

Closes #31221

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

amogkam · 2023-01-19T23:30:25Z

python/ray/train/base_trainer.py

+        for key in self._param_dict.keys():
+            if key in ["train_loop_config"]:
+                result[key] = copy.deepcopy(self._param_dict[key])
+        return result


can this just be self._param_dict.get("train_loop_config", {})?

Ah i see. I would prefer if train_loop_config is not in the picture, we should not upload something like "train_loop_config=None or {}" to wandb.

I inverted the inner and outer loop to make it more efficient.

Also added a test to assert that train_loop_config won't always be uploaded.

we should not upload something like "train_loop_config=None or {}" to wandb.

Not sure if I'm misunderstanding this comment, but doesn't this happen with current implementation?

Like, if your trainer looks like

TorchTrainer( ... train_loop_config=None # Or `train_loop_config` isn't specified )

Then "train_loop_config" in self._param_space and _extract_fields_for_tuner_param_space returns {"train_loop_config": None}?

hmm I am confused - maybe I am missing something here.

For

TorchTrainer( ... train_loop_config=None )

it will be uploaded as None. But I don't think that's a big deal as the user specified it (for no good reason imo)

However, if it's

TorchTrainer( ... )

Then it won't show up. It's not in self._param_space.

I printed self._param_space for reference:

{'train_loop_per_worker': <function test_trainer_wandb_integration.<locals>.train_func at 0x7fdb3b108a70>, 'scaling_config': ScalingConfig(num_workers=2), 'run_config': RunConfig(callbacks=[<ray.air.tests.mocked_wandb_integration.WandbTestExperimentLogger object at 0x7fdb3b103fd0>])}

Also see more in the added unit test.

You're right, I misread the __new__ implementation; train_loop_config isn't included in self._param_space if it isn't explicitly specified.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

bveeramani

LGTM!

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…ix_wandb

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…tches_benchmark_single_node test (ray-project#31864) The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in ray-project#31752 and ray-project#31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention. Signed-off-by: Scott Lee <sjl@anyscale.com>

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…ix_wandb

…project#31866)" (ray-project#31867) This reverts commit 58386d0.

…ject#31865) Signed-off-by: rickyyx <rickyx@anyscale.com> we added job id in the log prefix in ray-project#31772, breaking the test. This fixes the test to reflect the change.

Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR ray-project#31633

Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it. Signed-off-by: yhna <yhna@si-analytics.ai> Signed-off-by: YH <100389977+yhna940@users.noreply.github.com> Signed-off-by: Younghwan Na <100389977+yhna940@users.noreply.github.com> Signed-off-by: yhna940 <yhna940@gmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

…ject#31858) This reverts commit 608276b. Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).

…shed tasks (ray-project#31761) This PR handles 2 edges when marking tasks as fail: When a job finishes but tasks still running should be marked as failed. Don't override a task's finished or failed timestamp when an ancestor failed. For 1: It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed For 2: It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree. This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree. When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree. During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.

krfricke · 2023-01-24T00:36:40Z

python/ray/train/tests/test_wandb_integration.py

@@ -0,0 +1,80 @@
+"""


Can we move this into air/tests/test_integration_wandb? It's confusing to have to separate files to test the same integration.

hmmm, I actually find putting all tests about wandb integration in one file kind of confusing.

my understanding of air/tests/test_integration_wandb, is everything about testing the logger callback as well as mixin. Those are the fundamental primitives of the integration, thus in air/.

for this test, it's testing the e2e story of train integration with it - like if certain fields pertained to trainer will finally show up in wandb ui. (you can imagine there will be another one that tests the e2e tune integration with it as well - maybe with more emphasis on testing multiple trials' behavior)

imo dumping all of them into a single file feels crowded.

python/ray/train/base_trainer.py

@cadedaniel

…project#31824) Signed-off-by: amogkam amogkamsetty@yahoo.com Closes ray-project#30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: ray-project#30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <amogkamsetty@yahoo.com>

…roject#31873) we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.

…oject#31397) - ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node - improve ray worker memory allocation computation. - refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up. Signed-off-by: Kai Fricke <kai@anyscale.com>

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

…ix_wandb

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Yard1 · 2023-01-24T18:47:57Z

@xwjiang2010 rebase got messed up it seems

xwjiang2010 · 2023-01-24T19:24:13Z

Sorry messed up with commit history, closing out.

[wandb] Have train_loop_config logged as a config.

3fd5fe6

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

xwjiang2010 assigned amogkam, krfricke and bveeramani Jan 19, 2023

amogkam reviewed Jan 19, 2023

View reviewed changes

xwjiang2010 and others added 2 commits January 19, 2023 16:56

address comments.

d314963

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'ray-project:master' into fix_wandb

1e9205d

bveeramani approved these changes Jan 21, 2023

View reviewed changes

xwjiang2010 and others added 13 commits January 23, 2023 08:29

fix lint

308cf8a

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'fix_wandb' of https://github.com/xwjiang2010/ray into f…

247f7dd

…ix_wandb

fix lint

8c91125

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'ray-project:master' into fix_wandb

4e51984

add WANDB_ENV_VAR

f6a64fd

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'fix_wandb' of https://github.com/xwjiang2010/ray into f…

b0f0b08

…ix_wandb

Revert "Revert "[AIR] Deprecations for 2.3 (ray-project#31763)" (ray-…

46b3bef

…project#31866)" (ray-project#31867) This reverts commit 58386d0.

[ci][core] Fix state api large scale test log prefix mismath (ray-pro…

3b09a54

…ject#31865) Signed-off-by: rickyyx <rickyx@anyscale.com> we added job id in the log prefix in ray-project#31772, breaking the test. This fixes the test to reflect the change.

Revert "Simplify logging configuration. (ray-project#30863)" (ray-pro…

0c69020

…ject#31858) This reverts commit 608276b. Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).

krfricke reviewed Jan 24, 2023

View reviewed changes

amogkam and others added 8 commits January 23, 2023 17:04

[wandb] Have train_loop_config logged as a config.

99ddcbb

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

address comments.

a51d0f9

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

fix lint

cff0e5e

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

fix lint

c210796

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

xwjiang2010 added 4 commits January 24, 2023 10:14

add WANDB_ENV_VAR

82c4ae2

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

move to class attribute.

011f952

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Merge branch 'fix_wandb' of https://github.com/xwjiang2010/ray into f…

64f2522

…ix_wandb

actually use the class attribute.

27d51d9

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

xwjiang2010 requested review from rkooo567, simon-mo, architkulkarni, wuisawesome, DmitriGekhtman, maxpumperla, pcmoritz, a team, sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, kouroshHakha, clarkzinzow, jiaodong, Yard1, matthewdeng and ericl as code owners January 24, 2023 18:21

xwjiang2010 mentioned this pull request Jan 24, 2023

[wandb] Have train_loop_config logged as a config. #31901

Merged

7 tasks

xwjiang2010 closed this Jan 24, 2023

xwjiang2010 deleted the fix_wandb branch July 26, 2023 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wandb] Have `train_loop_config` logged as a config. #31790

[wandb] Have `train_loop_config` logged as a config. #31790

xwjiang2010 commented Jan 19, 2023 •

edited

Loading

amogkam Jan 19, 2023 •

edited

Loading

xwjiang2010 Jan 20, 2023

bveeramani Jan 20, 2023

xwjiang2010 Jan 21, 2023

bveeramani Jan 21, 2023

bveeramani left a comment

krfricke Jan 24, 2023

xwjiang2010 Jan 24, 2023

Yard1 commented Jan 24, 2023

xwjiang2010 commented Jan 24, 2023

[wandb] Have train_loop_config logged as a config. #31790

[wandb] Have train_loop_config logged as a config. #31790

Conversation

xwjiang2010 commented Jan 19, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

xwjiang2010 Jan 20, 2023

Choose a reason for hiding this comment

bveeramani Jan 20, 2023

Choose a reason for hiding this comment

xwjiang2010 Jan 21, 2023

Choose a reason for hiding this comment

bveeramani Jan 21, 2023

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

krfricke Jan 24, 2023

Choose a reason for hiding this comment

xwjiang2010 Jan 24, 2023

Choose a reason for hiding this comment

Yard1 commented Jan 24, 2023

xwjiang2010 commented Jan 24, 2023

[wandb] Have `train_loop_config` logged as a config. #31790

[wandb] Have `train_loop_config` logged as a config. #31790

xwjiang2010 commented Jan 19, 2023 •

edited

Loading

amogkam Jan 19, 2023 •

edited

Loading