-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example with collecting timestamp of the metrics #970
Changes from 9 commits
bebf6c0
f1bd3ca
cafd115
2e57536
f08fd42
c8acc67
30eaad5
f840692
1aa3a8e
d52bd9d
923902a
9954e6b
d3cbdda
68979df
b3ca4ae
a714749
933fb3c
6710326
ccbd475
0c651a5
f25c13d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
apiVersion: "kubeflow.org/v1alpha3" | ||
kind: Experiment | ||
metadata: | ||
namespace: kubeflow | ||
name: metrics-with-timestamp | ||
spec: | ||
objective: | ||
type: maximize | ||
goal: 0.99 | ||
objectiveMetricName: accuracy | ||
additionalMetricNames: | ||
- loss | ||
algorithm: | ||
algorithmName: random | ||
parallelTrialCount: 3 | ||
maxTrialCount: 12 | ||
maxFailedTrialCount: 3 | ||
parameters: | ||
- name: --lr | ||
parameterType: double | ||
feasibleSpace: | ||
min: "0.01" | ||
max: "0.03" | ||
- name: --momentum | ||
parameterType: double | ||
feasibleSpace: | ||
min: "0.3" | ||
max: "0.7" | ||
trialTemplate: | ||
goTemplate: | ||
rawTemplate: |- | ||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: {{.Trial}} | ||
namespace: {{.NameSpace}} | ||
spec: | ||
template: | ||
spec: | ||
containers: | ||
- name: {{.Trial}} | ||
image: docker.io/andreyvelichkevich/timestamp-metric | ||
imagePullPolicy: Always | ||
command: | ||
- "python" | ||
- "/var/mnist.py" | ||
- "--epochs=3" | ||
{{- with .HyperParameters}} | ||
{{- range .}} | ||
- "{{.Name}}={{.Value}}" | ||
{{- end}} | ||
{{- end}} | ||
restartPolicy: Never |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
FROM pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime | ||
|
||
WORKDIR /var | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe /var is appropriate location for the script. If you check FSH (https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard) I would suggest using WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. What do you think about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any particular reason you do not want to go with existing directory structure in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this answer to @johnugeorge'S question?:-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
To this question :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, sorry:) as I suggested in my first comment - I would go for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @johnugeorge @hougangliu What do you think about |
||
ADD mnist.py /var | ||
|
||
ENTRYPOINT ["python", "/var/mnist.py"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
from __future__ import print_function | ||
|
||
import argparse | ||
import logging | ||
import os | ||
|
||
from torchvision import datasets, transforms | ||
import torch | ||
import torch.distributed as dist | ||
import torch.nn as nn | ||
import torch.nn.functional as F | ||
import torch.optim as optim | ||
|
||
WORLD_SIZE = int(os.environ.get('WORLD_SIZE', 1)) | ||
|
||
# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics | ||
logging.basicConfig( | ||
format="%(asctime)s %(levelname)-8s %(message)s", | ||
datefmt="%Y-%m-%dT%H:%M:%SZ", | ||
level=logging.DEBUG) | ||
|
||
|
||
class Net(nn.Module): | ||
def __init__(self): | ||
super(Net, self).__init__() | ||
self.conv1 = nn.Conv2d(1, 20, 5, 1) | ||
self.conv2 = nn.Conv2d(20, 50, 5, 1) | ||
self.fc1 = nn.Linear(4*4*50, 500) | ||
self.fc2 = nn.Linear(500, 10) | ||
|
||
def forward(self, x): | ||
x = F.relu(self.conv1(x)) | ||
x = F.max_pool2d(x, 2, 2) | ||
x = F.relu(self.conv2(x)) | ||
x = F.max_pool2d(x, 2, 2) | ||
x = x.view(-1, 4*4*50) | ||
x = F.relu(self.fc1(x)) | ||
x = self.fc2(x) | ||
return F.log_softmax(x, dim=1) | ||
|
||
def train(args, model, device, train_loader, optimizer, epoch): | ||
model.train() | ||
for batch_idx, (data, target) in enumerate(train_loader): | ||
data, target = data.to(device), target.to(device) | ||
optimizer.zero_grad() | ||
output = model(data) | ||
loss = F.nll_loss(output, target) | ||
loss.backward() | ||
optimizer.step() | ||
if batch_idx % args.log_interval == 0: | ||
msg = 'Train Epoch: {} [{}/{} ({:.0f}%)]\tloss: {:.4f}'.format( | ||
epoch, batch_idx * len(data), len(train_loader.dataset), | ||
100. * batch_idx / len(train_loader), loss.item()) | ||
print(msg) | ||
logging.debug(msg) | ||
niter = epoch * len(train_loader) + batch_idx | ||
|
||
def test(args, model, device, test_loader, epoch): | ||
model.eval() | ||
test_loss = 0 | ||
correct = 0 | ||
with torch.no_grad(): | ||
for data, target in test_loader: | ||
data, target = data.to(device), target.to(device) | ||
output = model(data) | ||
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss | ||
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability | ||
correct += pred.eq(target.view_as(pred)).sum().item() | ||
|
||
test_loss /= len(test_loader.dataset) | ||
# This metrics will be saved by Metrics Collector | ||
logging.info('Test after Epoch: {}'.format(epoch)) | ||
logging.info('accuracy={}'.format(float(correct) / len(test_loader.dataset))) | ||
logging.info('loss={}'.format(test_loss)) | ||
|
||
def should_distribute(): | ||
return dist.is_available() and WORLD_SIZE > 1 | ||
|
||
|
||
def is_distributed(): | ||
return dist.is_available() and dist.is_initialized() | ||
|
||
|
||
def main(): | ||
# Training settings | ||
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') | ||
parser.add_argument('--batch-size', type=int, default=64, metavar='N', | ||
help='input batch size for training (default: 64)') | ||
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N', | ||
help='input batch size for testing (default: 1000)') | ||
parser.add_argument('--epochs', type=int, default=10, metavar='N', | ||
help='number of epochs to train (default: 10)') | ||
parser.add_argument('--lr', type=float, default=0.01, metavar='LR', | ||
help='learning rate (default: 0.01)') | ||
parser.add_argument('--momentum', type=float, default=0.5, metavar='M', | ||
help='SGD momentum (default: 0.5)') | ||
parser.add_argument('--no-cuda', action='store_true', default=False, | ||
help='disables CUDA training') | ||
parser.add_argument('--seed', type=int, default=1, metavar='S', | ||
help='random seed (default: 1)') | ||
parser.add_argument('--log-interval', type=int, default=10, metavar='N', | ||
help='how many batches to wait before logging training status') | ||
parser.add_argument('--save-model', action='store_true', default=False, | ||
help='For Saving the current Model') | ||
if dist.is_available(): | ||
parser.add_argument('--backend', type=str, help='Distributed backend', | ||
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI], | ||
default=dist.Backend.GLOO) | ||
args = parser.parse_args() | ||
use_cuda = not args.no_cuda and torch.cuda.is_available() | ||
if use_cuda: | ||
print('Using CUDA') | ||
|
||
torch.manual_seed(args.seed) | ||
|
||
device = torch.device("cuda" if use_cuda else "cpu") | ||
|
||
if should_distribute(): | ||
print('Using distributed PyTorch with {} backend'.format(args.backend)) | ||
dist.init_process_group(backend=args.backend) | ||
|
||
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} | ||
train_loader = torch.utils.data.DataLoader( | ||
datasets.MNIST('../data', train=True, download=True, | ||
transform=transforms.Compose([ | ||
transforms.ToTensor(), | ||
transforms.Normalize((0.1307,), (0.3081,)) | ||
])), | ||
batch_size=args.batch_size, shuffle=True, **kwargs) | ||
test_loader = torch.utils.data.DataLoader( | ||
datasets.MNIST('../data', train=False, transform=transforms.Compose([ | ||
transforms.ToTensor(), | ||
transforms.Normalize((0.1307,), (0.3081,)) | ||
])), | ||
batch_size=args.test_batch_size, shuffle=False, **kwargs) | ||
|
||
model = Net().to(device) | ||
|
||
if is_distributed(): | ||
Distributor = nn.parallel.DistributedDataParallel if use_cuda \ | ||
else nn.parallel.DistributedDataParallelCPU | ||
model = Distributor(model) | ||
|
||
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) | ||
|
||
for epoch in range(1, args.epochs + 1): | ||
train(args, model, device, train_loader, optimizer, epoch) | ||
test(args, model, device, test_loader, epoch) | ||
|
||
if (args.save_model): | ||
torch.save(model.state_dict(),"mnist_cnn.pt") | ||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added you into kubeflowkatib, you can now push kubeflowkatib/timestamp-metric now
BTW, I think you can rename the image name to
kubeflowkatib/mxnet-mnist-example
to make all example has timestamp.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename folder and file name also to
metrics-with-timestamp
tomxnet-mnist-example
. There is no need to highlight the termwith-timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a
mxnet-mnist-example
.I took code of pytorch-mnist from here: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/file-metrics-collector/mnist.py.
Should I change example to Mxnet in that case? Mxnet example is that one: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_mnist.py.
Or we can name this example
kubeflowkatib/pytorch-mnist-example
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it. IIRC @richardsliu created this image to show mxnet support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge So what can we do with this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu Can you point out at the Dockerfile used?