-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wandb] Have train_loop_config
logged as a config.
#31790
Conversation
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
for key in self._param_dict.keys(): | ||
if key in ["train_loop_config"]: | ||
result[key] = copy.deepcopy(self._param_dict[key]) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just be self._param_dict.get("train_loop_config", {})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i see. I would prefer if train_loop_config
is not in the picture, we should not upload something like "train_loop_config=None or {}" to wandb.
I inverted the inner and outer loop to make it more efficient.
Also added a test to assert that train_loop_config won't always be uploaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not upload something like "train_loop_config=None or {}" to wandb.
Not sure if I'm misunderstanding this comment, but doesn't this happen with current implementation?
Like, if your trainer looks like
TorchTrainer(
...
train_loop_config=None # Or `train_loop_config` isn't specified
)
Then "train_loop_config" in self._param_space
and _extract_fields_for_tuner_param_space
returns {"train_loop_config": None}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I am confused - maybe I am missing something here.
For
TorchTrainer(
...
train_loop_config=None
)
it will be uploaded as None. But I don't think that's a big deal as the user specified it (for no good reason imo)
However, if it's
TorchTrainer(
...
)
Then it won't show up. It's not in self._param_space
.
I printed self._param_space
for reference:
{'train_loop_per_worker': <function test_trainer_wandb_integration.<locals>.train_func at 0x7fdb3b108a70>, 'scaling_config': ScalingConfig(num_workers=2), 'run_config': RunConfig(callbacks=[<ray.air.tests.mocked_wandb_integration.WandbTestExperimentLogger object at 0x7fdb3b103fd0>])}
Also see more in the added unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I misread the __new__
implementation; train_loop_config
isn't included in self._param_space
if it isn't explicitly specified.
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…tches_benchmark_single_node test (ray-project#31864) The release test iter_tensor_batches_benchmark_single_node has failed the most recent run, due to the same issue discussed/addressed in ray-project#31752 and ray-project#31493 (the actual error message is: botocore.exceptions.DataNotFoundError: Unable to load data for: ec2/2016-11-15/endpoint-rule-set-1). This PR updates one remaining test to match this convention. Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
…project#31866)" (ray-project#31867) This reverts commit 58386d0.
…ject#31865) Signed-off-by: rickyyx <rickyx@anyscale.com> we added job id in the log prefix in ray-project#31772, breaking the test. This fixes the test to reflect the change.
Signed-off-by: Bukic Tomislav, AVL <tomislav.bukic@avl.com> Added some knowledge about using existing Grafana instances to the monitoring documentation as suggested on Slack channel. Fixed version of PR ray-project#31633
Some Nevergrad search algorithms have required inputs, such as `budget` for the `NgOpt` search algorithm, but it is not possible with the NevergradSearch class to pass these parameters down to the search algorithm. I would propose adding something like optimizer_kwargs to the NevergradSearch that get passed to the optimizer when instantiating it. Signed-off-by: yhna <yhna@si-analytics.ai> Signed-off-by: YH <100389977+yhna940@users.noreply.github.com> Signed-off-by: Younghwan Na <100389977+yhna940@users.noreply.github.com> Signed-off-by: yhna940 <yhna940@gmail.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>
…ject#31858) This reverts commit 608276b. Looks like this breaks the backward compatibility of rllib (it is supposed to print warn first, but it prints the info log).
…shed tasks (ray-project#31761) This PR handles 2 edges when marking tasks as fail: When a job finishes but tasks still running should be marked as failed. Don't override a task's finished or failed timestamp when an ancestor failed. For 1: It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed For 2: It adds an ancestor_failed_ts to keep track of ancestor failure time in the task tree. This extra bit of info is necessary to keep since we should not be overriding any already failed or finished child tasks's timestamps. But we will also need to know if any task subtree has been traversed (and all non-terminated children marked as failed) w/o traversing the task tree. When adding a new task event, If the task fails or its ancestor failed, its failed_ts and ancestor_failed_ts will be set, and we will traverse into the child task tree. During the tree traversal, when a task has its failed_ts or ancestor_failed_ts set, this means its children should have been traversed when its failed_ts or ancestor_failed_ts was set.
@@ -0,0 +1,80 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into air/tests/test_integration_wandb
? It's confusing to have to separate files to test the same integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I actually find putting all tests about wandb integration in one file kind of confusing.
my understanding of air/tests/test_integration_wandb
, is everything about testing the logger callback as well as mixin. Those are the fundamental primitives of the integration, thus in air/.
for this test, it's testing the e2e story of train integration with it - like if certain fields pertained to trainer will finally show up in wandb ui. (you can imagine there will be another one that tests the e2e tune integration with it as well - maybe with more emphasis on testing multiple trials' behavior)
imo dumping all of them into a single file feels crowded.
…project#31824) Signed-off-by: amogkam amogkamsetty@yahoo.com Closes ray-project#30333. Previously, we would set a default NCCL interface whitelist in Ray Train to prioritize ethernet. This is to avoid this issue: anyscale/product#8310. However, this default whitelist is not fully exhaustive, and prevents users from doing distributed GPU training over wireless: ray-project#30333. Instead, we change to a blacklist so that NCCL does not use veth interface which resolves both issues (thanks @cadedaniel for identifying this!) Signed-off-by: amogkam <amogkamsetty@yahoo.com>
…roject#31873) we test grpcio pre release to catch early regressions. we caught the latest prerelease is problematic (grpc/grpc#31885). To get a better signal, let's temporary stop testing grpcio prerelease until it's fixed.
…oject#31397) - ray on spark creates spark job using stage scheduling, so that ray cluster spark job can use different task resources config ( spark.task.cpus / spark.task.resource.gpu.amount ), otherwise it has to use spark application level config, which is inconvenient on Databricks. 2 new arguments are added: num_cpus_per_node and num_gpus_per_node - improve ray worker memory allocation computation. - refactor _init_ray_cluster interface, make it fit better for instrumentation logging patching (make arguments key value only, and adjust some arguments, make all arguments to be validated values) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Python adds the current working directory as the first entry to `sys.path` . In some python versions, this will lead stdlib libraries to import modules in the working directory instead of other stdlib modules (python/cpython#101210). In our case, the conflict is `types.py`. Because `types.py` contains a PublicAPI we don't want to just rename it without proper deprecation. Instead we fix the `sys.path` in `setup-dev.py` as this is the only location where the problem predictably comes up. Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
@xwjiang2010 rebase got messed up it seems |
Sorry messed up with commit history, closing out. |
Signed-off-by: xwjiang2010 xwjiang2010@gmail.com
Why are these changes needed?
Have
train_loop_config
logged as a config in wandb UI.This is done by surfacing
Trainer._param_dict
intoTuner.param_space
. Currently, only done fortrain_loop_config
.Related issue number
Closes #31221
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.