Skip to content

Conversation

hildebrandecon
Copy link
Contributor

@hildebrandecon hildebrandecon commented Jan 14, 2022

Closes #1 as far as I can tell: I adjusted the tests in

  • test_parallel.py and
  • test_parametrize.py.

Now, the run through.

I also adjusted the ReadMe.

@hildebrandecon
Copy link
Contributor Author

I think this should be fine, however, I couldn't figure out (so far) how exactly julia treats wrong cmd line arguments.

@codecov
Copy link

codecov bot commented Jan 14, 2022

Codecov Report

Merging #2 (c99b27c) into main (2b15b07) will increase coverage by 29.47%.
The diff coverage is 100.00%.

❗ Current head c99b27c differs from pull request most recent head 4ff5c48. Consider uploading reports for the commit 4ff5c48 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##             main       #2       +/-   ##
===========================================
+ Coverage   69.06%   98.54%   +29.47%     
===========================================
  Files          13       13               
  Lines         320      343       +23     
===========================================
+ Hits          221      338      +117     
+ Misses         99        5       -94     
Flag Coverage Δ
end_to_end 85.42% <91.07%> (?)
unit 67.93% <44.64%> (-1.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/pytask_julia/execute.py 100.00% <ø> (ø)
src/pytask_julia/collect.py 100.00% <100.00%> (+3.77%) ⬆️
tests/test_collect.py 98.61% <100.00%> (+0.01%) ⬆️
tests/test_execute.py 100.00% <100.00%> (+48.07%) ⬆️
tests/test_parallel.py 96.15% <100.00%> (+65.38%) ⬆️
tests/test_parametrize.py 100.00% <100.00%> (+63.33%) ⬆️
tests/conftest.py 100.00% <0.00%> (+14.28%) ⬆️
tests/test_config.py 100.00% <0.00%> (+33.33%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b15b07...4ff5c48. Read the comment docs.

Copy link
Contributor

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Just added some pausing for the parallel tasks, those failed on my machine.

I was wondering, however, whether we should use the great momentum right away to implement dict-style passing of arguments as well?

I could see either using argparse or some standardized (hidden) json file. The latter might be more flexible and actually make it easier to run from editors / ides?

Something like

.pytask.[task_file].[task_name].[id].json

with contents:

{
    "depends_on": ["[dict]", "[list]", "[single file]", null],
    "produces": ["[dict]", "[list]", "[single file]", null],
    "other_keys": "[whatever]"
}

@hmgaudecker
Copy link
Contributor

image

I understand this is a free service, but are those i386 machines? 😆

@tobiasraabe
Copy link
Member

@hmgaudecker If one test occasionally fails I have no problem with that. Same in the other repos but it is really not that likely.

@tobiasraabe
Copy link
Member

tobiasraabe commented Jan 14, 2022

@hildebrandecon Please, follow the link of the failing the pre-commit ci step and see whether you can fix the errors to make the pipeline pass.

@hmgaudecker
Copy link
Contributor

@tobiasraabe, what do you think about implementing passing of options via json, see above.

Also, we may want to check that command line options recognized by Julia are parsed correctly. Something like pass number of threads and check within Julia that the number is correct? Once with 1, once with 4?

@tobiasraabe
Copy link
Member

tobiasraabe commented Jan 14, 2022

@hmgaudecker Just for consistency and finishing stuff, I would advocate for merging this PR first to align with the other plugins. We will release it as v0.1.0.

After that, and with a potential v0.2.0 release the interface can be reworked. I am nor sure I understand how you want to use the json. Do you want to be able to define tasks in json files?

@hmgaudecker
Copy link
Contributor

Sounds great.

No, I would just want to use it to pass the dependencies and targets (and other options) in a way that resembles Python tasks. Right now, you can only pass arguments by order (=2 lists and on Julia/R side impossible to discriminate between deps/targets without knowing the mapping of list elements to files). Very error prone.

@hildebrandecon
Copy link
Contributor Author

hildebrandecon commented Jan 14, 2022

@hildebrandecon Please, follow the link of the failing the pre-commit ci step and see whether you can fix the errors to make the pipeline pass.

@tobiasraabe I managed to resolve some issues. I'm not sure how to fix/treat the remaining ones, unfortunately.

@hmgaudecker
Copy link
Contributor

Just fix by ignoring (although I left that explicitly out from the course 😄)

Also, we may want to check that command line options recognized by Julia are parsed correctly. Something like pass number of threads and check within Julia that the number is correct? Once with 1, once with 4?

@hildebrandecon @mbrambeer Could you draft a test for this? That is, pass explicitly to Julia that it should use only 1 thread at most, then check inside Julia that this is caught correctly. Repeat with 4. Test should include depends_on/produces because the order mattered in our experiments yesterday.

I think this is more important than checking what happens with wrong options. Not really pytask's Job to check whether the user typed stuff correctly (it would be nice, of course, but could also become a maintenance burden).

@tobiasraabe
Copy link
Member

tobiasraabe commented Jan 15, 2022

Sounds great.

No, I would just want to use it to pass the dependencies and targets (and other options) in a way that resembles Python tasks. Right now, you can only pass arguments by order (=2 lists and on Julia/R side impossible to discriminate between deps/targets without knowing the mapping of list elements to files). Very error prone.

Yes, makes sense. To be more bold, you could pass the json data via the command line as an argument and parsing the string inside the program to get rid of the .pytask.....json file.

@hmgaudecker
Copy link
Contributor

In principle yes, but the nice thing about the file (also relative to command line options) is that it makes it easy to run Julia/R from editors/IDEs/any way other than pytask.

@hildebrandecon
Copy link
Contributor Author

Just fix by ignoring (although I left that explicitly out from the course 😄)

Also, we may want to check that command line options recognized by Julia are parsed correctly. Something like pass number of threads and check within Julia that the number is correct? Once with 1, once with 4?

@hildebrandecon @mbrambeer Could you draft a test for this? That is, pass explicitly to Julia that it should use only 1 thread at most, then check inside Julia that this is caught correctly. Repeat with 4. Test should include depends_on/produces because the order mattered in our experiments yesterday.

I think this is more important than checking what happens with wrong options. Not really pytask's Job to check whether the user typed stuff correctly (it would be nice, of course, but could also become a maintenance burden).

@hmgaudecker I must say that I don't understand what exactly you have in mind, probably because I'm not as an IT expert. However, here would be my rough idea what you would want to do:

  • we can tell julia how many threads to use via julia -t nthreads, where nthreads is the number of threads.
  • then, we can check within julia that all went well with Threads.nthreads() which should evaluate to nthreads.
  • apart from that, I'm not sure how to continue, I think because I don't have a clear idea how I can check the 2nd step within pytask.

@tobiasraabe
Copy link
Member

Just add something to the script which makes it fail if nthreads does not have the desired number. In Python you would use the assert statement or raising an exception. Both make the script fail with nonzero exit codes which pytask will detect and mark the task as failed.

@hmgaudecker
Copy link
Contributor

Yip, that's precisely what I had in mind!

@hildebrandecon
Copy link
Contributor Author

Ok, I wrote a draft but in fact, it doesn't work. The test should fail since in line 141 we pass 4 threads. If I run the same stuff in the cmd line, it in fact works.

Replication:

  1. create a file script.jl that contains write("out.txt", "So, so you think you can tell heaven from hell?") \n @assert Threads.nthreads() == 1
  2. in cmd line, run julia -t 1 script.jl: works
  3. in cmd line, run julia -t 4 script.jl: gives error
  4. note: if we run julia script.jl -t 4, we don't get an error (this is the order-sensitivity that we expected...)

@hildebrandecon
Copy link
Contributor Author

It seems that one has to pass arguments regarding Julia startup before the script but arguments used in the function itself after the script, replication:

script.jl contains:

for arg in ARGS
    println(arg)
end
@assert Threads.nthreads() == 1

In the shell, run: julia --threads 1 script.jl "hello1" "hello2" "hello3" without error but julia --threads 4 script.jl "hello1" "hello2" "hello3" with error. Putting --threads 1 at the end yields that it is printed too.

@hmgaudecker
Copy link
Contributor

Thanks! I would hope writing the test is enough, @tobiasraabe has too much of an advantage of fixing the call from pytask for us to worry much about it 😉

However, can you parametrize the test with both 1 and 4 threads? I think it defaults to 4 on a machine with 4 cores (I might misremember) so that would be just to avoid random passes based on the machine type.

Comment on lines 156 to 157
os.chdir(tmp_path)
session = main({"paths": tmp_path})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont need to switch paths and better use the runner patter from here

result = runner.invoke(cli, [tmp_path.as_posix()])

@tobiasraabe
Copy link
Member

I don't really have a good idea. I feel like this forces us to replicate the command line interface and understand whether options are flags, accept and input, accept multiple inputs and what are arguments to the script, split them and place them to the left or right of the script.

One approach to circumvent the replication would be to ask users to add -- when they are adding flags/options and arguments to the script in the decorator as a separator. For example, @pytask.mark.julia(("--threads", "1", "--", "0")). -- is also a valid separator on the command line to do that. Then we could split arguments at the double-dash, add the script just before the separator and recreate the command line string again.

@hmgaudecker
Copy link
Contributor

Ah, sorry, I thought you had some sense or order in there. My understanding is that all you'd need to do is to swap the contents of mark.julia(...) with whatever is in depends_on / produces? "Options first" seems a good enough understanding for how Julia parses the command line, it seems to me...

@tobiasraabe
Copy link
Member

Unfortunately, it is not enough to match the content of the decorator with depends_on and produces since you could also pass something like seeds to the script. And those other values could be related to a preceding command line option or not. I don't want to resolve something like this

--threads 1  # belongs together.
--verbose 1  # does not belong together and 1 should be passed to the script.

Ok, plan for the implementation.

  1. A decorator with no content will default to subprocess.run(("julia", "--", "script.jl")).
  2. If content is provided, I would suggest that the double dash is required and we raise an error with a helpful error message if not. I think it will avoid confusion which justifies entering -- from time to time.

Everyone agrees?

@hmgaudecker
Copy link
Contributor

Okay, talking past each other here.

As per our JSON discussion, I agree that pytask should not worry about resolving stuff.

I would say the pytask.mark("...") content is exclusively reserved for options passed to Julia as opposed to the script. pytask just needs to ensure this ends up before script.jl in subprocess.run(("julia", "--", "script.jl")), contrary to what is currently happening.

Anything else will be solved by different means in a different PR.

@hmgaudecker
Copy link
Contributor

hmgaudecker commented Jan 17, 2022

contrary to what is currently happening.

That is, what is happening based on what @hildebrandecon posted above, I did not check in detail and it might not catch the option for other reasons.

@tobiasraabe
Copy link
Member

Okay, talking past each other here.

As per our JSON discussion, I agree that pytask should not worry about resolving stuff.

I would say the pytask.mark("...") content is exclusively reserved for options passed to Julia as opposed to the script. pytask just needs to ensure this ends up before script.jl in subprocess.run(("julia", "--", "script.jl")), contrary to what is currently happening.

Anything else will be solved by different means in a different PR.

This would mean that v0.1.0 does not support all functionality. It does not allow users to pass any arguments to script as it is possible with the other plugins. We could pass depends on and produces by default to the script and the user can optionally parse them or not. But, something like passing a seed to a script would not work.

I see the json approach as an alternative interface and a way to move forward, but I would prefer something fully working before that.

@hmgaudecker
Copy link
Contributor

Fair enough, I cannot delve deeply enough into it. So your view is to not pass any options to Julia, just to scripts?

@hildebrandecon
Copy link
Contributor Author

I think I cannot follow the discussion entirely, given that I don't understand most of the pytask details. Hence, I'm also not sure whether I can give any educated opinion about what best/not to do. Of course, let me know if I can be of help!

@tobiasraabe
Copy link
Member

tobiasraabe commented Jan 18, 2022

I am sorry. I feel like I have not expressed myself clearly enough and caused some unnecessary confusion. Let me try it once more (although rather lengthy).

The decorator @pytask.mark.julia or @pytask.mark.r are used to pass options (like --threads) to the executable and arguments (like a seed or paths from dependencies or products) to the script.

@pytask.mark.r(("input.md", "1", "--verbose"))
@pytask.mark.depends_on(["script.R", "input.md"])
@pytask.mark.produces("out.md")
def task_write_number_with_r_to_file():
    pass

This way, although highly ugly, the R scripts receives the path to the markdown file and a number which can be written to the output file.

We also pass the --verbose flag which is just representative of any option which can be passed to Rscript.

The main point here is that pytask-r can just take the tuple in the decorator and pass it on the command line to Rscript like

$ Rscript script.R input.md 1 --verbose

With julia this is not possible since a flag like --verbose or --threads need to be put on the command line after the executable julia and before the script.

This means pytask-julia needs to know whether something is an option for the executable or input to the script.

As a super lightweight workaround and ugly hack, I propose to require users, if they use the @pytask.mark.julia decorator, to separate options to the executable and inputs to the script via --. Taking the example from above, this would look like this:

@pytask.mark.julia(("--verbose", "--", "input.md", "1"))
@pytask.mark.depends_on(["script.jl", "input.md"])
@pytask.mark.produces("out.md")
def task_write_number_with_julia_to_file():
    pass

This way pytask-julia knows how to build the input to subprocess.run. Put everything before -- right after the executable and before adding the path to the script taking from the first dependency. Then, add everything to the right side of --.

$ julia --verbose -- script.jl input.md 1

I now also want to make it mandatory to always specify -- if the user passes any values to @pytask.mark.julia so that it is always clear whether input refers to the executable or script. Two examples

@pytask.mark.julia(("--", "input.md"))  # to pass a path to the script

@pytask.mark.julia(("--threads", "4", "--"))  # to enable 4 threads while running the program.

The reason why I believe -- should be mandatory is that it is not clear looking at the two cases above whether one case could omit the -- because it would feel more natural to pass something to the executable or to the script.

I know this is an extremely dirty hack and I am deeply unsatisfied with the situation. It is not an excuse to say waf was equally bad. We need to circumvent the positional indexing and get something label-based soon. But, in this PR, I just want to make pytask-julia as powerful as the other plugins so we have a common base to start from.

@hildebrandecon
Copy link
Contributor Author

thanks @tobiasraabe for the extensive explanation, I think your arguments make perfect sense!

@hmgaudecker
Copy link
Contributor

Sounds great, thanks for the detailed explanation.

I think my view is to just not spend too much time on this right now. There will be few users and even less power users for the moment.

Looking back at Waf and the current state of pytask-{R, Stata, Julia), I do think that requiring command-line options is a major inhibitor for take-up. For the typical (student) data cruncher, the most important feature is that starting from an IDE is seamless and the same as starting from pytask. But that's for another discussion!

@tobiasraabe
Copy link
Member

@hildebrandecon @hmgaudecker Hi, I implemented the described approach. The tests checking the number of threads inside the script works as well.

Please review the readme and test whether the package works for you.

@tobiasraabe tobiasraabe added this to the v0.1.0 milestone Jan 18, 2022
@hildebrandecon
Copy link
Contributor Author

hildebrandecon commented Jan 18, 2022

@tobiasraabe tests run smoothly on my machine & I think the readme is in line with all changes. Thanks a lot!

Copy link
Contributor

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a couple of comments on documenting things and a suggestion for another trivial test.

def test_check_passing_cmd_line_options(tmp_path):
task_source = """
import pytask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be doubled up with a second number, different from 4. Just to avoid that whatever machine it is run on, it passes for reasons of the default happening to be 4.

@hildebrandecon
Copy link
Contributor Author

I extended the cmd line option test as @hmgaudecker has proposed. I tried to do it with parametrize at first but I'm not sure whether one can check the session code separately then.

@tobiasraabe tobiasraabe merged commit e47e8db into main Jan 19, 2022
@tobiasraabe tobiasraabe deleted the dev_seb branch January 19, 2022 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Several tests are not working.

3 participants