NEW: Adding first cut of output handling support. #12

prabhuramachandran · 2015-06-22T06:02:40Z

This is mainly for review and is a first implementation. In particular, I am not too convinced about name of the classes.

The PR adds an Output and a CollectOutputs class that help gathering
outputs generated by a Launcher. With this, the current example in the documentation becomes:

import lancet
# [...]
lancet.Launcher(example_name, integers, factor_cmd, output_directory='output')()
# Run the launcher several times if needed.

# Here is the new code to analyze the data.
from lancet import output
collect = output.CollectOutputs('output')
latest = collect.get_latest()

factors = [open(path).read().replace(':', '').split() for path in latest.stdout]
primes = sorted(f[0] for f in data if f[0] == f[1])

CollectOutputs is very simple. The Output class has a convenient do_expansion method that is convenient when using LongFilenames. For example if one were dumping output files using it,
one could simply do:

 lf = lancet.ShellCommand.LongFilename('.png')
 output_files = latest.do_expansion(lf)

Then the list of output files (in the same sequence as the tasks and specs) can be processed with any Python code.

This adds an `Output` and a `CollectOutputs` class that help gathering outputs generated by a Launcher.

jlstevens · 2015-06-22T15:24:25Z

I have finally got around to having a look!

I've run the new example and my initial impressions are favorable: having things like the info, the log and the tids available as attributes is certainly very nice.

Some comments:

I am not sure we need two classes here. Instead, I would be tempted to have a single class Output that has an update method you call when you want to update the state.
I would make Output a parameterized class with a single parameter, probably called output_dir (i.e inherit from param.Parameterized and not object).
Calling get_latest returns the last output in self.outputs but this attribute is initialized in the constructor of CollectOutputs. It will therefore fail to find new directories unless you also create a new CollectOutputs instance.
If this were implemented as a single class, you would still want the various information (tids, stderr, stdout etc) per path found in the output directory. For this, I would consider using a namedtuple instead of defining a new class for holding the extracted information.
As only the output directory is needed, this could indeed be the return type after launching which could make it even more convenient.
We will want some nice, sensible repr for the class(es).

This is a great starting point and I can see the utility of such a system. Here, I'll show how I imagine improving it:

>>> output = lancet.Launcher(example_name, integers, factor_cmd, 
                             output_directory='output')()
>>> output = lancet.Output('output')  # Equivalent to the return value above
>>> output.update()                   # Capture the current state of the filesystem
>>> len(output)                       # Length two because previous run exists.
2
>>> output.paths
['2015-06-22_1356-prime_quintuplet', '2015-06-22_1538-prime_quintuplet']
>>> '2015-06-22_1356-prime_quintuplet' in output
True
# Accessing tid field of named tuple for this run...
>>> output['2015-06-22_1356-prime_quintuplet'].tids 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# The output directory of the most recent run (call it last_run maybe?)
>>> output.last.output_dir   
'output/2015-06-22_1356-prime_quintuplet'
# No need to show much else as contents is mutable (filesystem) state
>>> print repr(output)
Output(output_directory='output')

I think this design would cover everything you've suggested except do_expansion. I think, this could be the one optional parameter of Output, making it more declarative as follows:

>>> output = Output('output', expansions={'full_filename':ShellCommand.LongFilename})
# Additional field added to the named tuple using the expansion.
>>> output.last.long_filename
'/home/user/.../output/2015-06-22_1356-prime_quintuplet'

What do you think?

I am happy to immediately implement my suggestion if you like this updated proposal. I could commit such a class directly to master but then your contribution won't be properly recorded. Alternatively, maybe I could commit to your branch to update this PR - or submit a PR on your PR?!

Of course, you could have a go implementing this yourself, but if you are happy with my suggestions above, I think I could get it implemented and merged with master very quickly.

prabhuramachandran · 2015-06-22T20:49:42Z

Thanks for the quick review. I am happy you like the general idea. Here are my comments:

I have not used param for the first cut as I am not familiar with it and thought that can be easily done at the end if the basic design is OK.
I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.
The idea of using a namedtuple seems excellent but my gut tells me it isn't too extensible. Thinking more about it, it makes a lot of sense as our output attributes are really immutable. Let me think more about that. In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??
The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type. I find this ugly. As a user I don't really care about the gory details of the name. A sequence of runs for which I can get information is perfect. For example in your example you have:

>>> output.paths
['2015-06-22_1356-prime_quintuplet', '2015-06-22_1538-prime_quintuplet']
>>> '2015-06-22_1356-prime_quintuplet' in output
True
# Accessing tid field of named tuple for this run...
>>> output['2015-06-22_1356-prime_quintuplet'].tids

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is? In any case, what if it looked like this:

>>> output = lancet.Output('output')  # Equivalent to the return value above
>>> output.update()                   # Capture the current state of the filesystem
>>> len(output)
2
>>> output[0]
OutputData(tids=[...], specs=[...], ...)
>>> for tid, spec in zip(output[0].tids, output[0].specs):
...    print tid, spec
>>> output.paths
[...]
>>> output.last # same as output[-1] and can simply be a property.

Updating the state is not critical to me as the class is very lightweight but yes, it is useful to have that method (especially if this is used on a notebook) so I can easily add that.
CollectOutputs is definitely not very clean at all and should really have an update method if at all it will exist
Yes, I agree it can be returned by the Launcher -- that was my original idea but I like that the Output is reusable outside of the Launcher, which is perfect for post-processing any runs.

Long discussions like this on github are painful as I cannot quote your points. Can we do this on email? Or should I just reply to the gh email I receive?

jlstevens · 2015-06-22T21:17:58Z

Long discussions like this on github are painful as I cannot quote your points. Can we do this on email? Or should I just reply to the gh email I receive?

You can just reply to the e-mail if that is more convenient for you. I'll keep the conversation on GitHub for now and simply make use the markdown quote syntax above.

I have not used param for the first cut as I am not familiar with it and thought that can be easily done at the end if the basic design is OK.

Yes, adding param is no problem at all.

I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.

I think we would need some very clear ideas as to how we want to extend these ideas in future before the introduction of two new classes is really justified...

In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??

I think I would consider OutputInfo ('data' suggests it is the meat of the output which won't always be the case).

The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type.

I mostly agree and that is why I proposed the .last property to make it easy to access the most recent run.

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is?

Again I agree with you here ('runs' may also be a better name). The integer indexing approach is definitely a better idea than the strings and would work nicely in conjunction with len. Working with integer run numbers would allow you to use range(len(output)) to index any available information over all the runs.

The only thing to be careful about is the ordering of the runs - in the simplest case this could be based on a simple alphanumerical sort over the run names (this would also be a chronological sort using the default timestamp format). However, I think there is better way: the launch timestamp is explicitly recorded in the info file so it would probably make a lot of sense to use that.

This way, you would be certain that output[0] is the oldest run, output[-1] is the most recent run and everything in between is also in chronological order.

Lastly, I will note that the integer scheme is not exclusive with indexing via explicit path name - __getitem__ could support both integer (list-like) and string-like (dictionary like) indexing.

prabhuramachandran · 2015-06-23T05:59:32Z

You can just reply to the e-mail if that is more convenient for you. I'll keep the conversation on GitHub for now and simply make use the markdown quote syntax above.

I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.

I think we would need some very clear ideas as to how we want to extend these ideas in future before the introduction of two new classes is really justified...

In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??

I think I would consider OutputInfo ('data' suggests it is the meat of the output which won't always be the case).

Alright, so technically there are two objects, Output and OutputInfo but the latter is lightweight. I like this so will modify it that way when I get the chance either today or tomorrow.

The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type.

I mostly agree and that is why I proposed the .last property to make it easy to access the most recent run.

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is?

Again I agree with you here ('runs' may also be a better name). The integer indexing approach is definitely a better idea than the strings and would work nicely in conjunction with len. Working with integer run numbers would allow you to use range(len(output)) to index any available information over all the runs.

Right, so I will try the names and see how this works.

The only thing to be careful about is the ordering of the runs - in the simplest case this could be based on a simple alphanumerical sort over the run names (this would also be a chronological sort using the default timestamp format). However, I think there is better way: the launch timestamp is explicitly recorded in the info file so it would probably make a lot of sense to use that.

Well but the timestamps of the directory are already sorted according to the timestamp and I think it is safe to assume that the output of one launch corresponds to one problem (or is that a dangerous assumption?) I certainly think it is reasonable and will use that for the time being.

This way, you would be certain that output[0] is the oldest run, output[-1] is the most recent run and everything in between is also in chronological order.

Yes, we should get this as it currently stands and can fix bugs if we see them.

Lastly, I will note that the integer scheme is not exclusive with indexing via explicit path name - getitem could support both integer (list-like) and string-like (dictionary like) indexing.

Indeed and we could this also a bit later and just go with integers for now.

As regards the expansions, I think there may be a misunderstanding, in my use case LongFilenames are different for each invocation of the command. As a result the OutputInfo will have additional attributes corresponding to the expansions.

In summary I'll make the following changes when I am able:

One Output class which contains OutputInfo namedtuples for each particular "run".
OutputInfo has additional attributes for any expansions by default, tids, specs, output_dir, stdout, stderr, log, info and then any extra expansion keys.
Access to each run info via integers along with the last one.
Access to the paths for each "run" of the Launcher
Try to parametrize the class.

Have I missed anything?

jlstevens · 2015-06-23T12:09:51Z

Your summary sounds good and I look forward to seeing the updated code!

A couple of quick comments:

I agree that an alphanumeric sort is sufficient, at least for now. That said, I think there is another neat and easy trick we can use! If timestamp is in the first entry in the named tuples (fairly important information in my opinion!), then simply sorting the list of named tuples would always be guaranteed to get you a chronological order. I suspect this might even be easier that trying to get them in order based on the alphanumeric ordering of the launch directory names...
I wouldn't bother implementing the last property if the integer indexing work (output[-1] would be the same as .last).
In the end, I think we should probably call each index a launch instead of a run as I think this would be more consistent with the existing terminology we have used. I would rather not have something that sounds like a new concept when there really isn't one!
Related to the above, the named tuple should probably be called LaunchInfo because the tuple is containing part of the output captured by Output as opposed to all the output.

Other than these minor points, I think this will be a useful class to introduce without requiring much more effort to implement. Great!

- One `Output` class which contains `LaunchInfo` namedtuples for each particular "launch". - `OutputInfo` has additional attributes for any expansions by default, `timestamp, path tids, specs, stdout, stderr, log, info` and then any extra expansion keys. - Access to each launch info via integers via `__getitem__`. - Ability to iterate over the output object. - Parametrize the class. - Remove `CollectOutputs` class.

prabhuramachandran · 2015-06-23T20:30:35Z

I've implemented everything above but want to write some tests. I'll do that tomorrow, feel free to take a look.

jlstevens · 2015-06-23T20:41:03Z

At a glance it looks great!

Two immediate comments:

I would remove lines like this one as we don't have anything else like it in the code:

##### object Protocol ####################################################

I think I would be happy to merge this into launch.py because 1) Launcher could return such an object 2) I don't think this class is too long or really justifies a new file.

I'll have a closer look shortly, but I think it is pretty much what I hoped/expected.

jlstevens · 2015-06-23T20:42:45Z

Travis is failing due to some whitespace issue in the doctest (I think).

jlstevens · 2015-06-23T20:44:29Z

lancet/output.py

+from glob import glob
+import json
+import os
+from os.path import isdir, join, splitext


I think I would generally prefer just to import os and fully qualify as appropriate...

Umm, then we should technically import os.path many many years ago this used to be a problem but maybe I am just getting old. This SO article is not conclusive:

http://stackoverflow.com/questions/2724348/should-i-use-import-os-path-or-import-os

I think it comes down to personal taste - personally I've gotten so used to typing os.path that it has become automatic. My recommendation here is not because it is necessarily the best approach overall but because this is how it was done everywhere else in Lancet. Changing these imports throughout is something I would consider for a separate PR...

prabhuramachandran · 2015-06-23T20:45:47Z

I'll look at this tomorrow. I like the protocol lines as it makes code a lot easier to follow for me (and I am used to it) but it isn't consistent with your code so can remove it. I'm happy to merge it into launch.py although I think that file is quite big now at close to 1000 lines! I will let you modify the launchers. I'll fix travis tomorrow.

jlstevens · 2015-06-23T20:46:33Z

lancet/output.py

+            if isdir(full_path):
+                launches.append(self._get_launch_info(full_path))
+        launches.sort()
+        self.launches = launches


I would probably do this instead:

self.launches = sorted(launches)

Makes very little difference but it does save one line...

jlstevens · 2015-06-23T20:51:23Z

Ok, I've only made two comments in the code and otherwise I think it is ready to merge (into launch.py) as soon as Travis is happy!

I think I would put this just before the Launchers:

#===============#
# Output Helper #
#===============#

# <Your code here>

#===========#
# Launchers #
#===========#

Thanks for doing this! I'll return a suitable Output instance from the launchers after the merge...

prabhuramachandran · 2015-06-23T21:30:25Z

OK, I have made the requested changes. I haven't added any tests and am not sure I will have time to add them soon but am happy to do that if needed.

prabhuramachandran · 2015-06-24T08:24:08Z

I've also pushed a couple of simple tests now, if Travis is happy this is now RTM.

jlstevens · 2015-06-24T14:25:36Z

Looks good to me! Marco says he might have a few comments that he will add here later...

prabhuramachandran · 2015-06-24T17:27:24Z

OK, can't those be addressed in a separate PR?

jlstevens · 2015-06-24T19:12:36Z

Ok, I've quickly gone through it with Marco and he is also satisfied.

I'll go ahead and merge it now. Thanks again for your contribution!

Introduced Output class to conveniently collect information across launches

NEW: Adding first cut of output handling support.

76d8d69

This adds an `Output` and a `CollectOutputs` class that help gathering outputs generated by a Launcher.

prabhuramachandran mentioned this pull request Jun 22, 2015

Launcher.__call__ should return information on the output #9

Open

jlstevens reviewed Jun 23, 2015
View reviewed changes

prabhuramachandran added 2 commits June 24, 2015 02:53

More PR comment fixes and fix Travis.

f16883d

Merge outputs.py into launch.py.

7a895ae

Adding tests for Output.

b733a49

jlstevens added a commit that referenced this pull request Jun 24, 2015

Merge pull request #12 from prabhuramachandran/collect-outputs

09c8fd9

Introduced Output class to conveniently collect information across launches

jlstevens merged commit 09c8fd9 into ioam:master Jun 24, 2015

prabhuramachandran deleted the collect-outputs branch June 25, 2015 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEW: Adding first cut of output handling support. #12

NEW: Adding first cut of output handling support. #12

prabhuramachandran commented Jun 22, 2015

jlstevens commented Jun 22, 2015

prabhuramachandran commented Jun 22, 2015

jlstevens commented Jun 22, 2015

prabhuramachandran commented Jun 23, 2015

jlstevens commented Jun 23, 2015

prabhuramachandran commented Jun 23, 2015

jlstevens commented Jun 23, 2015

jlstevens commented Jun 23, 2015

jlstevens Jun 23, 2015

prabhuramachandran Jun 23, 2015

jlstevens Jun 23, 2015

prabhuramachandran commented Jun 23, 2015

jlstevens Jun 23, 2015

jlstevens commented Jun 23, 2015

prabhuramachandran commented Jun 23, 2015

prabhuramachandran commented Jun 24, 2015

jlstevens commented Jun 24, 2015

prabhuramachandran commented Jun 24, 2015

jlstevens commented Jun 24, 2015

NEW: Adding first cut of output handling support. #12

NEW: Adding first cut of output handling support. #12

Conversation

prabhuramachandran commented Jun 22, 2015

jlstevens commented Jun 22, 2015

prabhuramachandran commented Jun 22, 2015

jlstevens commented Jun 22, 2015

prabhuramachandran commented Jun 23, 2015

jlstevens commented Jun 23, 2015

prabhuramachandran commented Jun 23, 2015

jlstevens commented Jun 23, 2015

jlstevens commented Jun 23, 2015

jlstevens Jun 23, 2015

Choose a reason for hiding this comment

prabhuramachandran Jun 23, 2015

Choose a reason for hiding this comment

jlstevens Jun 23, 2015

Choose a reason for hiding this comment

prabhuramachandran commented Jun 23, 2015

jlstevens Jun 23, 2015

Choose a reason for hiding this comment

jlstevens commented Jun 23, 2015

prabhuramachandran commented Jun 23, 2015

prabhuramachandran commented Jun 24, 2015

jlstevens commented Jun 24, 2015

prabhuramachandran commented Jun 24, 2015

jlstevens commented Jun 24, 2015