Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running katib on latest master (04/13) #44

Closed
ddysher opened this issue Apr 13, 2018 · 7 comments
Closed

Error running katib on latest master (04/13) #44

ddysher opened this issue Apr 13, 2018 · 7 comments

Comments

@ddysher
Copy link
Member

ddysher commented Apr 13, 2018

After deploying katib following getting started guide, I've seen the following errors:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY     STATUS             RESTARTS   AGE
katib         dlk-manager-698ccb5fdc-hb7xc              0/1       CrashLoopBackOff   6          13m
katib         modeldb-backend-6855d95fb4-2sxw9          1/1       Running            0          14m
katib         modeldb-db-6cf5bb764-5s65f                1/1       Running            0          14m
katib         modeldb-frontend-5868bffc64-rhrr7         1/1       Running            0          14m
katib         vizier-core-86c5566c88-kvsp9              0/1       CrashLoopBackOff   6          13m
katib         vizier-db-64557596dc-mpgh4                1/1       Running            0          13m
katib         vizier-suggestion-random-6b4d6db6-m8l94   0/1       CrashLoopBackOff   6          13m
kube-system   kube-dns-5c6c5b55b-qmd9l                  3/3       Running            0          16m

I've managed to get it running; it turns out the command is not correct. For example, I have to change this:

    spec:
      serviceAccountName: vizier-core
      containers:
      - name: vizier-core
        image: katib/vizier-core
        args:
          - "-w"
          - "dlk"
        ports:
        - name: api
          containerPort: 6789

to

    spec:
      serviceAccountName: vizier-core
      containers:
      - name: vizier-core
        image: katib/vizier-core
        args:
          - ./vizier-manager    <-- add this line
          - "-w"
          - "dlk"
        ports:
        - name: api
          containerPort: 6789

However, based on docker file for vizier-core, vizier-manager is already set as entrypoint,

FROM golang:alpine AS build-env
# The GOPATH in the image is /go.
ADD . /go/src/github.com/kubeflow/hp-tuning
WORKDIR /go/src/github.com/kubeflow/hp-tuning/manager
RUN go build -o vizier-manager

FROM alpine:3.7
WORKDIR /app
COPY --from=build-env /go/src/github.com/kubeflow/hp-tuning/manager/vizier-manager /app/
COPY --from=build-env /go/src/github.com/kubeflow/hp-tuning/manager/visualise /
ENTRYPOINT ["./vizier-manager"]
CMD ["-w", "dlk"]

Anything wrong with the above 👆 setup?

/cc @gaocegege @YujiOshima

@YujiOshima
Copy link
Contributor

Hi @ddysher .
Do you use katib/~ docker images?
I'm sorry I didn't update the images the latest version.
It is not automated..
I updated the images. Please retry.
If you will still have a problem, show me the log of vizier-core.
kubectl -n katib logs deploy/vizier-core

@gaocegege
Copy link
Member

Yeah, we do not push the latest image, thanks for your issue! @ddysher

@ddysher
Copy link
Member Author

ddysher commented Apr 16, 2018

@YujiOshima thanks

I pulled latest image but saw the following error while trying to list/create studies. Looks like some protobuf issues.

$ katib -s 10.0.0.59:6789 Getstudies                        
2018/04/16 09:42:16 connecting 10.0.0.59:6789
2018/04/16 09:42:16 GetStudy failed: rpc error: code = 12 desc = unknown method GetStudys

$ katib -s 10.0.0.59:6789 -f conf/random-cpu.yml Createstudy
2018/04/16 09:42:20 connecting 10.0.0.59:6789
2018/04/16 09:42:20 study conf{cifer10 root MAXIMIZE 0 configs:<name:"--lr" parameter_type:DOUBLE feasible:<max:"0.07" min:"0.03" > > configs:<name:"--lr-factor" parameter_type:DOUBLE feasible:<max:"0.2" min:"0.05" > > configs:<name:"--max-random-h" parameter_type:INT feasible:<max:"46" min:"26" > > configs:<name:"--max-random-l" parameter_type:INT feasible:<max:"75" min:"25" > > configs:<name:"--num-epochs" parameter_type:INT feasible:<max:"3" min:"3" > >  [] random median  [name:"SuggestionNum" value:"2"  name:"MaxParallel" value:"2" ] [] Validation-accuracy [accuracy] mxnet/python [python /mxnet/example/image-classification/train_cifar10.py --batch-size=512] 0 default-scheduler <nil> }
2018/04/16 09:42:20 req Createstudy
2018/04/16 09:42:20 CreateStudy failed: rpc error: code = 13 desc = grpc: error unmarshalling request: proto: can't skip unknown wire type 6 for api.Tag

@YujiOshima
Copy link
Contributor

YujiOshima commented Apr 16, 2018

@ddysher Thank you for reporting! It looks the API of cli is old.

@gaocegege Did you upload CLI with release version code?
It looks to be built with older commit code.

@ddysher
Copy link
Member Author

ddysher commented Apr 16, 2018

@YujiOshima Thanks! I built from source and it works now.

@gaocegege
Copy link
Member

gaocegege commented Apr 16, 2018

@ddysher Did you use the binary downloaded from the release? I think it works with that version but it can not work with master.

Cool, then could we close the issue?

@ddysher
Copy link
Member Author

ddysher commented Apr 16, 2018

sure

@ddysher ddysher closed this as completed Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants