Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/worker: run task from file #594

Merged
merged 7 commits into from
Jul 10, 2018

Conversation

buchanae
Copy link
Contributor

@delagoya requested a CLI entrypoint for running the worker on a task file.

This is the 20-minute hack version. It works, but needs some cleanup. Logging is currently broken, so there's no output.

Run as funnel worker run -f examples/hello-world.json.

@delagoya
Copy link

Thanks @buchanae . Will test this weekend.

QQ - by "logging is broken" do you mean that stderr/out are not captured, or that job metrics are not written to a file?

@delagoya
Copy link

Error, think you forgot to add the additional flag to possible options?

$ funnel worker run -f examples/hello-world.json .
Error: unknown shorthand flag: 'f' in -f

@buchanae
Copy link
Contributor Author

Hrm, looks like the -f flag is there. Try --taskFile. Are you on the right branch?
https://github.com/ohsu-comp-bio/funnel/pull/594/files#diff-d08f22d6934437e90a900bb723b4b64bR83

QQ - by "logging is broken" do you mean that stderr/out are not captured, or that job metrics are not written to a file?

I meant that the task logs/state are not written back to the database. Logs are written to stderr, I believe.

@delagoya
Copy link

delagoya commented Jul 1, 2018

Whoops, that was the issue. Working now, thanks!

@delagoya
Copy link

delagoya commented Jul 2, 2018

Can confirm that this is working as expected for my needs. Also helpful for local dev/test of task files. What is the likelihood of merging into main repo?

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

Nice! We'll get this in sometime this week.

@delagoya
Copy link

delagoya commented Jul 2, 2018

Actually just got an error running examples/s3.json

{
    "attempt": 0,
    "cmd": "docker run -i --read-only --rm --name bct5fmd3fkig03t5idc0-0 -v /funnel_root/funnel-work-dir/bct5fmd3fkig03t5idc0/tmp:/tmp:rw -v /funnel_root/funnel-work-dir/bct5fmd3fkig03t5idc0/tmp/file.xml:/tmp/file.xml:ro ubuntu md5sum /tmp/file.xml",
    "index": 0,
    "level": "info",
    "msg": "Running command",
    "ns": "worker",
    "taskID": "bct5fmd3fkig03t5idc0",
    "time": "2018-07-02T16:50:38Z",
    "timestamp": "2018-07-02T16:50:38.494883827Z"
}
{
    "attempt": 0,
    "index": 0,
    "level": "info",
    "msg": "EXECUTOR_STDERR",
    "ns": "worker",
    "stderr": "md5sum: /tmp/file.xml: Is a directory\n",
    "taskID": "bct5fmd3fkig03t5idc0",
    "time": "2018-07-02T16:50:39Z",
    "timestamp": "2018-07-02T16:50:39.2357146Z"
}

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

I'm getting a download error from that example:

...snip...

worker               download started
attempt              0
index                0
taskID               bct5kjqrl6qg6n2keosg
time                 2018-07-02T10:01:05-07:00
timestamp            2018-07-02T10:01:05.858365532-07:00
url                  s3://irs-form-990/200931393493000150_public.xml

worker               download failed
attempt              0
error                amazonS3: calling stat on URL s3://irs-form-990/200931393493000150_public.xml: Forbidden: Forbidden
                     	status code: 403, request id: FE7C7C2E0FE20D66, host id: ysnjjxSxU6aLvb5BhC/Y4Z2DW0fut9X70Zl5eChpLGZQeqUaCVITCq7d9nxDFpUbubiFKJNaHLo=
index                0
taskID               bct5kjqrl6qg6n2keosg
time                 2018-07-02T10:01:06-07:00
timestamp            2018-07-02T10:01:06.637281559-07:00
url                  s3://irs-form-990/200931393493000150_public.xml

worker               TASK_END_TIME
attempt              0
end_time             2018-07-02T10:01:06.637389348-07:00
index                0
taskID               bct5kjqrl6qg6n2keosg
time                 2018-07-02T10:01:06-07:00
timestamp            2018-07-02T10:01:06.637389964-07:00

worker               System error
attempt              0
error                amazonS3: calling stat on URL s3://irs-form-990/200931393493000150_public.xml: Forbidden: Forbidden
                     	status code: 403, request id: FE7C7C2E0FE20D66, host id: ysnjjxSxU6aLvb5BhC/Y4Z2DW0fut9X70Zl5eChpLGZQeqUaCVITCq7d9nxDFpUbubiFKJNaHLo=
index                0
taskID               bct5kjqrl6qg6n2keosg
time                 2018-07-02T10:01:06-07:00
timestamp            2018-07-02T10:01:06.637445871-07:00

worker               TASK_STATE
attempt              0
index                0
state                SYSTEM_ERROR
taskID               bct5kjqrl6qg6n2keosg
time                 2018-07-02T10:01:06-07:00
timestamp            2018-07-02T10:01:06.63748373-07:00

Error: amazonS3: calling stat on URL s3://irs-form-990/200931393493000150_public.xml: Forbidden: Forbidden
	status code: 403, request id: FE7C7C2E0FE20D66, host id: ysnjjxSxU6aLvb5BhC/Y4Z2DW0fut9X70Zl5eChpLGZQeqUaCVITCq7d9nxDFpUbubiFKJNaHLo=

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

In your case, I'm guessing that the file is missing, so docker mounts the volume (-v /funnel_root/funnel-work-dir/bct5fmd3fkig03t5idc0/tmp/file.xml:/tmp/file.xml:ro) as a directory.

@delagoya
Copy link

delagoya commented Jul 2, 2018

I've tried a different public S3 file with the same error as above. File downloads successfully, root cause error is coming from MD5 tool that is not able to handle a directory mount.

NOTE: This is using Amazon ECS-Optimized AMI which defaults to Docker version 17.12.1-ce

Works fine for my local dev box Docker version 18.03

@delagoya
Copy link

delagoya commented Jul 2, 2018

Oh wait, I was using a private S3 file. Likely the same error, file was not actually downloaded, but the worker reports a successful download ... Any way to tell funnel to leave the files on disk?

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

I'm not sure I follow. How is the worker getting to the execute step if the download failed? It should be stopping if downloads fail.

We have a flag:

      --Worker.LeaveWorkDir                Leave working directory after execution

is that what you're looking for?

@delagoya
Copy link

delagoya commented Jul 2, 2018

Can confirm that the file actually does download. md5 still sees this as a directory not a file. Is there a conflict of ordering of volume mounts when you mount both the $FUNNEL_TASK_ROOT/tmp and $FUNNEL_TASK_ROOT/tmp/file.xml ? E.g. is volume mount ordering not being respected?

$ docker run -it --rm  -v `pwd`:/funnel_root -v /var/run/docker.sock:/var/run/docker.sock delagoya/funnel-wff funnel-x86-linux  worker run -f s3.json --Worker.LeaveWorkDir
worker               Version
BuildDate
GitBranch
GitCommit
GitUpstream
Version              unknown
attempt              0
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:21Z
timestamp            2018-07-02T17:34:21.700250368Z

worker               TASK_STATE
attempt              0
index                0
state                INITIALIZING
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:21Z
timestamp            2018-07-02T17:34:21.700325476Z

worker               TASK_START_TIME
attempt              0
index                0
start_time           2018-07-02T17:34:21.70041638Z
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:21Z
timestamp            2018-07-02T17:34:21.700417799Z

worker               TASK_METADATA
attempt              0
index                0
metadata             map[string]string{"hostname":"6dd9f593cf1a"}
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:21Z
timestamp            2018-07-02T17:34:21.700504237Z

worker               download started
attempt              0
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:21Z
timestamp            2018-07-02T17:34:21.992174638Z
url                  s3://angelfiles/test.bam

worker               download finished
attempt              0
etag                 "7dadb86f612119fd146cad199a4372f3"
index                0
size                 23667
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:22Z
timestamp            2018-07-02T17:34:22.089436266Z
url                  s3://angelfiles/test.bam

worker               TASK_STATE
attempt              0
index                0
state                RUNNING
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:22Z
timestamp            2018-07-02T17:34:22.089592006Z

worker               EXECUTOR_START_TIME
attempt              0
index                0
start_time           2018-07-02T17:34:22.089658157Z
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:22Z
timestamp            2018-07-02T17:34:22.089660077Z

worker               Running command
attempt              0
cmd                  docker run -i --read-only --rm --name bct647dpaflg008pl19g-0 -v /funnel_root/funnel-work-dir/bct647dpaflg008pl19g/tmp:/tmp:rw -v /funnel_root/funnel-work-dir/bct647dpaflg008pl19g/tmp/file.xml:/tmp/file.xml:ro ubuntu md5sum /tmp/file.xml
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:22Z
timestamp            2018-07-02T17:34:22.435270034Z

worker               EXECUTOR_STDERR
attempt              0
index                0
stderr               md5sum: /tmp/file.xml: Is a directory

taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.209915942Z

worker               EXECUTOR_END_TIME
attempt              0
end_time             2018-07-02T17:34:23.578559283Z
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.578561444Z

worker               EXECUTOR_EXIT_CODE
attempt              0
exit_code            1
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.57867295Z

worker               TASK_END_TIME
attempt              0
end_time             2018-07-02T17:34:23.578730025Z
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.578731651Z

worker               Exec error
attempt              0
error                exit status 1
index                0
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.578781723Z

worker               TASK_STATE
attempt              0
index                0
state                EXECUTOR_ERROR
taskID               bct647dpaflg008pl19g
time                 2018-07-02T17:34:23Z
timestamp            2018-07-02T17:34:23.578821673Z

Error: exit status 1

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

ah, yes, that's an interesting point. Try downloading the file to a location in the container outside /tmp. e.g. /tmp/file.xml -> /inputs/file.xml.

Looks like we always add a /tmp volume, presumably so we can make the filesystem read-only.
https://github.com/ohsu-comp-bio/funnel/blob/master/worker/file_mapper.go#L75

@delagoya
Copy link

delagoya commented Jul 2, 2018

Same error. I think I know what is going on. I had a funnel as a Docker container with a WORKDIR specification that was different than the host mount, which meant that the launched sibling container that is running the task did not see the same filesystem that the main container saw, causing a mismatch ... it is discussed a bit in this blog article on sibling containers

Temp solution is to ensure that the paths of WORKDIR and bind mounts given to a Docker that launches sibling container(s) for the executor(s) matches the expectations. Longer term, may have to have a flag that supplies additional mounts to task containers as part of the worker config?

@buchanae
Copy link
Contributor Author

buchanae commented Jul 2, 2018

I've run into this before too. It takes some mental gymnastics to visualize which process puts files where. Docker-in-docker would be simpler to reason about in this respect, not sure exactly what the tradeoffs are.

Longer term, may have to have a flag that supplies additional mounts to task containers as part of the worker config?

Yes, that sounds good and should be simple. We've talked about templating the entire executor (docker) command anyway.

@buchanae
Copy link
Contributor Author

buchanae commented Jul 6, 2018

@delagoya we were chatting about the sibling containers issue, and another idea was to bake funnel into the AMI.

@delagoya
Copy link

delagoya commented Jul 9, 2018 via email

@buchanae
Copy link
Contributor Author

buchanae commented Jul 9, 2018

You mean the Funnel server in the AMI, and have a container task contact it
for work?

Something like that. Or, maybe give the worker an internal REST endpoint(s) so the server can be cut out.

@buchanae buchanae force-pushed the worker-from-file branch from 76d0da1 to bf72c62 Compare July 9, 2018 22:08
@buchanae buchanae requested a review from adamstruck July 9, 2018 22:37
Copy link
Contributor

@adamstruck adamstruck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests please.

@buchanae buchanae requested a review from adamstruck July 10, 2018 17:13
@adamstruck adamstruck merged commit aa85fe8 into ohsu-comp-bio:master Jul 10, 2018
@buchanae
Copy link
Contributor Author

@delagoya Just curious, did this work out for you in the end?

@delagoya
Copy link

Worked great! Especially like the addition of base64 encoded string. Removes the need to write out a file before invoking, which means we can use the standard Docker container you publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants