Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wandb] Have train_loop_config logged as a config. #31790

Closed
wants to merge 28 commits into from

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Jan 19, 2023

Signed-off-by: xwjiang2010 xwjiang2010@gmail.com

Why are these changes needed?

Have train_loop_config logged as a config in wandb UI.

This is done by surfacing Trainer._param_dict into Tuner.param_space. Currently, only done for train_loop_config.

Related issue number

Closes #31221

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
for key in self._param_dict.keys():
if key in ["train_loop_config"]:
result[key] = copy.deepcopy(self._param_dict[key])
return result
Copy link
Contributor

@amogkam amogkam Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this just be self._param_dict.get("train_loop_config", {})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i see. I would prefer if train_loop_config is not in the picture, we should not upload something like "train_loop_config=None or {}" to wandb.

I inverted the inner and outer loop to make it more efficient.

Also added a test to assert that train_loop_config won't always be uploaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not upload something like "train_loop_config=None or {}" to wandb.

Not sure if I'm misunderstanding this comment, but doesn't this happen with current implementation?

Like, if your trainer looks like

TorchTrainer(
    ...
    train_loop_config=None  # Or `train_loop_config` isn't specified
)

Then "train_loop_config" in self._param_space and _extract_fields_for_tuner_param_space returns {"train_loop_config": None}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I am confused - maybe I am missing something here.

For

TorchTrainer(
    ...
    train_loop_config=None 
)

it will be uploaded as None. But I don't think that's a big deal as the user specified it (for no good reason imo)

However, if it's

TorchTrainer(
...
)

Then it won't show up. It's not in self._param_space.

I printed self._param_space for reference:

{'train_loop_per_worker': <function test_trainer_wandb_integration.<locals>.train_func at 0x7fdb3b108a70>, 'scaling_config': ScalingConfig(num_workers=2), 'run_config': RunConfig(callbacks=[<ray.air.tests.mocked_wandb_integration.WandbTestExperimentLogger object at 0x7fdb3b103fd0>])}

Also see more in the added unit test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I misread the __new__ implementation; train_loop_config isn't included in self._param_space if it isn't explicitly specified.

xwjiang2010 and others added 2 commits January 19, 2023 16:56
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

xwjiang2010 and others added 13 commits January 23, 2023 08:29
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
…tches_benchmark_single_node test (ray-project#31864)

The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in ray-project#31752 and ray-project#31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
…ject#31865)

Signed-off-by: rickyyx <rickyx@anyscale.com>

we added job id in the log prefix in ray-project#31772, breaking the test.

This fixes the test to reflect the change.
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com>

Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel.

Fixed version of PR ray-project#31633
Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it.

Signed-off-by: yhna <yhna@si-analytics.ai>
Signed-off-by: YH <100389977+yhna940@users.noreply.github.com>
Signed-off-by: Younghwan Na <100389977+yhna940@users.noreply.github.com>
Signed-off-by: yhna940 <yhna940@gmail.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ject#31858)

This reverts commit 608276b.

Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).
…shed tasks (ray-project#31761)

This PR handles 2 edges when marking tasks as fail:

When a job finishes but tasks still running should be marked as failed.
Don't override a task's finished or failed timestamp when an ancestor failed.
For 1:

It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed
For 2:

It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree.
This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree.
When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree.
During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.
@@ -0,0 +1,80 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into air/tests/test_integration_wandb? It's confusing to have to separate files to test the same integration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I actually find putting all tests about wandb integration in one file kind of confusing.

my understanding of air/tests/test_integration_wandb, is everything about testing the logger callback as well as mixin. Those are the fundamental primitives of the integration, thus in air/.

for this test, it's testing the e2e story of train integration with it - like if certain fields pertained to trainer will finally show up in wandb ui. (you can imagine there will be another one that tests the e2e tune integration with it as well - maybe with more emphasis on testing multiple trials' behavior)

imo dumping all of them into a single file feels crowded.

python/ray/train/base_trainer.py Outdated Show resolved Hide resolved
amogkam and others added 8 commits January 23, 2023 17:04
…project#31824)

Signed-off-by: amogkam amogkamsetty@yahoo.com

Closes ray-project#30333.

Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310.

However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: ray-project#30333.

Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!)

Signed-off-by: amogkam <amogkamsetty@yahoo.com>
…roject#31873)

we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.
…oject#31397)

- ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node

- improve ray worker memory allocation computation.

- refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values)

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). 

In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
@Yard1
Copy link
Member

Yard1 commented Jan 24, 2023

@xwjiang2010 rebase got messed up it seems

@xwjiang2010
Copy link
Contributor Author

Sorry messed up with commit history, closing out.

@xwjiang2010 xwjiang2010 deleted the fix_wandb branch July 26, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR] Train loop config isn't uploaded to W&B