- ✅ You use
wandb
/Weights & Biases to record your machine learning trials? - ✅ Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
- ✅ Your compute nodes and your head/login node (with internet) have access to a shared file system?
Then this package might be useful. For alternatives, see below.
You probably have been using export WANDB_MODE="offline"
on the compute nodes and then ran something like
cd /.../result_dir/
for d in $(ls -t -d */); do cd $d; wandb sync --sync-all; cd ..; done
from your head node (with internet access) every now and then.
However, obviously this is not very satisfying as it doesn't update live.
Sure, you could throw this in a while True
loop, but if you have a lot of trials in your directory, this will take forever, cause unnecessary network traffic and it's just not very elegant.
- You add a hook that is called every time an epoch concludes (that is, when we want to trigger a sync).
- You start the
wandb-osh
script in your head node with internet access. This script will now triggerwandb sync
upon request from one of the compute nodes.
Very simple: Every time an epoch concludes, the hook gets called and creates a file in the communication directory (~/.wandb_osh_communication
by default).
The wandb-osh
script that is running on the head node (with internet) reads these files and performs the synchronization.
With ray tune, you can use your ray head node as the place to synchronize from (rather than deploying it via the batch system as well, as the current docs suggest). See the note below or my demo repository.
Similar strategies might be possible for wandb
as well (let me know!).
pip3 install wandb-osh
For completeness, the extra dependencies lightning
and ray
are given, but they only ensure that the corresponding package is installed.
For example
pip3 install 'wandb-osh[lightning]'
also installs pytorch lightning if it is not already present, but has no other effect.
For development, make sure also to include the testing
extra requirement.
pip3 install --editable '.[testing]'
Two steps: Set up the hook, then run the script from your head node.
With pure wandb
Let's adapt the simple pytorch example from the wandb docs (it only takes 3 lines!):
import wandb
from wandb_osh.hooks import TriggerWandbSyncHook # <-- New!
trigger_sync = TriggerWandbSyncHook() # <--- New!
wandb.init(config=args, mode="offline")
model = ... # set up your model
# Magic
wandb.watch(model, log_freq=100)
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
wandb.log({"loss": loss})
trigger_sync() # <-- New!
With pytorch lightning
Simply add the TriggerWandbSyncLightningCallback
to your list of callbacks and you're good to go!
from wandb_osh.lightning_hooks import TriggerWandbSyncLightningCallback # <-- New!
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer
logger = WandbLogger(
project="project",
group="group",
offline=True,
)
model = MyLightningModule()
trainer = Trainer(
logger=logger,
callbacks=[TriggerWandbSyncLightningCallback()] # <-- New!
)
trainer.fit(model, train_dataloader, val_dataloader)
With ray tune
Note With ray tune, you might not need this package! While the approach suggested in the ray tune SLURM docs deploys the ray head on a worker node as well (so it doesn't have internet), this actually isn't needed. Instead, you can run the ray head and the tuning script on the head node and only submit batch jobs for your workers. In this way,
wandb
will be called from the head node and internet access is no problem there. For more information on this approach, take a look at my demo repository.
You probably already use the WandbLoggerCallback
callback. We simply add a second callback for wandb-osh
(it only takes two new lines!):
import os
from wandb_osh.ray_hooks import TriggerWandbSyncRayHook # <-- New!
os.environ["WANDB_MODE"] = "offline"
callbacks = [
WandbLoggerCallback(...), # <-- ray tune documentation tells you about this
TriggerWandbSyncRayHook(), # <-- New!
]
tuner = tune.Tuner(
trainable,
tune_config=...,
run_config=RunConfig(
...,
callbacks=callbacks,
),
)
With anything else
Simply take the TriggerWandbSyncHook
class and use it as a callback in your training
loop (as in the wandb
example above), passing the directory that wandb
is syncing
to as an argument.
After installation, you should have a wandb-osh
script in your $PATH
. Simply call it like this:
wandb-osh
The output will look something like this
INFO: Starting to watch /home/kl5675/.wandb_osh_command_dir
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/b1f60706 ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_b1f60706_4_attr_pt_thld=0.0273,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-42
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/92a3ef1b ... done.
INFO: Finished syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_92a3ef1b_1_attr_pt_thld=0.0225,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-07-49
INFO: Syncing /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17
Find logs at: /home/kl5675/ray_results/tcn-perfect-test-sync/DynamicTCNTrainable_a2caa9c0_2_attr_pt_thld=0.0092,batch_size=1,focal_alpha=0.2500,focal_gamma=2.0000,gnn_tracking_experiments_has_2022-11-03_17-08-17/wandb/debug-cli.kl5675.log
Syncing: https://wandb.ai/gnn_tracking/gnn_tracking/runs/a2caa9c0 ... done.
Take a look at wandb-osh --help
or check the documentation for all command line options.
You can add options to the wandb sync
call by placing them after --
. For example
wandb-osh -- --sync-all
I get the warning "wandb: NOTE: use wandb sync --sync-all to sync 1 unsynced runs from local directory."
You can start wandb-osh
with wandb-osh -- --sync-all
to always synchronize
all available runs.
How can I suppress logging messages (e.g., warnings about the syncing not being fast enough)
import wandb_osh
# for wandb_osh.__version__ >= 1.2.0
wandb_osh.set_log_level("ERROR")
pip3 install pre-commit
pre-commit install
Your help is greatly appreciated! Suggestions, bug reports and feature requests are best opened as github issues. You are also very welcome to submit a pull request!
Bug reports and pull requests are credited with the help of the allcontributors bot.
Barthelemy Meynard-Piganeau 🐛 |
MoH-assan 🐛 |
Cedric Leonard 💻 🐛 |