Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example with collecting timestamp of the metrics #970

Merged
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
bebf6c0
Increase Suggestion memLimit
andreyvelich Dec 9, 2019
f1bd3ca
Create getSuggestionConfigData function
andreyvelich Dec 10, 2019
cafd115
Change memLimit for nasrl
andreyvelich Dec 10, 2019
2e57536
Merge remote-tracking branch 'upstream/master' into increase-suggesti…
andreyvelich Dec 10, 2019
f08fd42
Change resources format for katib-config
andreyvelich Dec 11, 2019
c8acc67
Merge remote-tracking branch 'upstream/master' into issue-944-timesta…
andreyvelich Dec 11, 2019
30eaad5
Create example with recording metrics timestamp
andreyvelich Dec 11, 2019
f840692
Merge remote-tracking branch 'upstream/master' into issue-944-timesta…
andreyvelich Dec 12, 2019
1aa3a8e
Add comment line
andreyvelich Dec 12, 2019
d52bd9d
Merge remote-tracking branch 'upstream/master' into issue-944-timesta…
andreyvelich Jan 8, 2020
923902a
Change example from pytorch to mxnet
andreyvelich Jan 8, 2020
9954e6b
Delete find_mxnet file
andreyvelich Jan 8, 2020
d3cbdda
Change mxnet-mnist-timestamp to mxnet-mnist
andreyvelich Jan 10, 2020
68979df
Merge remote-tracking branch 'upstream/master' into issue-944-timesta…
andreyvelich Jan 10, 2020
b3ca4ae
Reduce num epochs in grid
andreyvelich Jan 10, 2020
a714749
Enable autoscaling in CI cluster
andreyvelich Jan 10, 2020
933fb3c
Add max nodes
andreyvelich Jan 10, 2020
6710326
Add num nodes 6
andreyvelich Jan 10, 2020
ccbd475
Increase num nodes
andreyvelich Jan 10, 2020
0c651a5
Change num nodes to 6
andreyvelich Jan 13, 2020
f25c13d
Remove autoscaling
andreyvelich Jan 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions examples/v1alpha3/metrics-with-timestamp-example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: kubeflow
name: metrics-with-timestamp
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
additionalMetricNames:
- loss
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: --momentum
parameterType: double
feasibleSpace:
min: "0.3"
max: "0.7"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: docker.io/andreyvelichkevich/timestamp-metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added you into kubeflowkatib, you can now push kubeflowkatib/timestamp-metric now
BTW, I think you can rename the image name to kubeflowkatib/mxnet-mnist-example to make all example has timestamp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to rename folder and file name also to metrics-with-timestamp to mxnet-mnist-example. There is no need to highlight the termwith-timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a mxnet-mnist-example.
I took code of pytorch-mnist from here: https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/file-metrics-collector/mnist.py.

Should I change example to Mxnet in that case? Mxnet example is that one: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/train_mnist.py.
Or we can name this example kubeflowkatib/pytorch-mnist-example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed it. IIRC @richardsliu created this image to show mxnet support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge So what can we do with this example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardsliu Can you point out at the Dockerfile used?

imagePullPolicy: Always
command:
- "python"
- "/var/mnist.py"
- "--epochs=3"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
6 changes: 6 additions & 0 deletions examples/v1alpha3/metrics-with-timestamp/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime

WORKDIR /var
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe /var is appropriate location for the script. If you check FSH (https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard) /var should be used for changing files.

I would suggest using /opt and ideally a subdir in there - e.g. /opt/mnist/

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. What do you think about /app?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you do not want to go with existing directory structure in /? I don't think it is a good practice in linux systems to put code in "random" dirs in /, but feel free to convince me otherwise:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in /examples/v1alpha3/metrics-with-timestamp ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in /examples/v1alpha3/metrics-with-timestamp ?

Is this answer to @johnugeorge'S question?:-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you do not want to go with existing directory structure in /? I don't think it is a good practice in linux systems to put code in "random" dirs in /, but feel free to convince me otherwise:)

To this question :)
Which WORKDIR you think will be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry:) as I suggested in my first comment - I would go for /opt/mnist/mnist.py or to make it extremely obvious what it is (although that is already captured in the image name, but you never know..) /opt/metrics-with-timestamp/mnist.py or something along those lines

Copy link
Member Author

@andreyvelich andreyvelich Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @hougangliu What do you think about WORKDIR for the example as /opt/metrics-with-timestamp/mnist.py ?

ADD mnist.py /var

ENTRYPOINT ["python", "/var/mnist.py"]
154 changes: 154 additions & 0 deletions examples/v1alpha3/metrics-with-timestamp/mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
from __future__ import print_function

import argparse
import logging
import os

from torchvision import datasets, transforms
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

WORLD_SIZE = int(os.environ.get('WORLD_SIZE', 1))

# Use this format (%Y-%m-%dT%H:%M:%SZ) to record timestamp of the metrics
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ",
level=logging.DEBUG)


class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5, 1)
self.conv2 = nn.Conv2d(20, 50, 5, 1)
self.fc1 = nn.Linear(4*4*50, 500)
self.fc2 = nn.Linear(500, 10)

def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4*4*50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)

def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
msg = 'Train Epoch: {} [{}/{} ({:.0f}%)]\tloss: {:.4f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item())
print(msg)
logging.debug(msg)
niter = epoch * len(train_loader) + batch_idx

def test(args, model, device, test_loader, epoch):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()

test_loss /= len(test_loader.dataset)
# This metrics will be saved by Metrics Collector
logging.info('Test after Epoch: {}'.format(epoch))
logging.info('accuracy={}'.format(float(correct) / len(test_loader.dataset)))
logging.info('loss={}'.format(test_loss))

def should_distribute():
return dist.is_available() and WORLD_SIZE > 1


def is_distributed():
return dist.is_available() and dist.is_initialized()


def main():
# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
help='disables CUDA training')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--save-model', action='store_true', default=False,
help='For Saving the current Model')
if dist.is_available():
parser.add_argument('--backend', type=str, help='Distributed backend',
choices=[dist.Backend.GLOO, dist.Backend.NCCL, dist.Backend.MPI],
default=dist.Backend.GLOO)
args = parser.parse_args()
use_cuda = not args.no_cuda and torch.cuda.is_available()
if use_cuda:
print('Using CUDA')

torch.manual_seed(args.seed)

device = torch.device("cuda" if use_cuda else "cpu")

if should_distribute():
print('Using distributed PyTorch with {} backend'.format(args.backend))
dist.init_process_group(backend=args.backend)

kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
datasets.MNIST('../data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=args.test_batch_size, shuffle=False, **kwargs)

model = Net().to(device)

if is_distributed():
Distributor = nn.parallel.DistributedDataParallel if use_cuda \
else nn.parallel.DistributedDataParallelCPU
model = Distributor(model)

optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

for epoch in range(1, args.epochs + 1):
train(args, model, device, train_loader, optimizer, epoch)
test(args, model, device, test_loader, epoch)

if (args.save_model):
torch.save(model.state_dict(),"mnist_cnn.pt")

if __name__ == '__main__':
main()