Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: Adding first cut of output handling support. #12

Merged
merged 5 commits into from
Jun 24, 2015

Conversation

prabhuramachandran
Copy link
Contributor

This is mainly for review and is a first implementation. In particular, I am not too convinced about name of the classes.

The PR adds an Output and a CollectOutputs class that help gathering
outputs generated by a Launcher. With this, the current example in the documentation becomes:

import lancet
# [...]
lancet.Launcher(example_name, integers, factor_cmd, output_directory='output')()
# Run the launcher several times if needed.

# Here is the new code to analyze the data.
from lancet import output
collect = output.CollectOutputs('output')
latest = collect.get_latest()

factors = [open(path).read().replace(':', '').split() for path in latest.stdout]
primes = sorted(f[0] for f in data if f[0] == f[1])

CollectOutputs is very simple. The Output class has a convenient do_expansion method that is convenient when using LongFilenames. For example if one were dumping output files using it,
one could simply do:

 lf = lancet.ShellCommand.LongFilename('.png')
 output_files = latest.do_expansion(lf)

Then the list of output files (in the same sequence as the tasks and specs) can be processed with any Python code.

This adds an `Output` and a `CollectOutputs` class that help gathering
outputs generated by a Launcher.
@jlstevens
Copy link
Member

I have finally got around to having a look!

I've run the new example and my initial impressions are favorable: having things like the info, the log and the tids available as attributes is certainly very nice.

Some comments:

  • I am not sure we need two classes here. Instead, I would be tempted to have a single class Output that has an update method you call when you want to update the state.
  • I would make Output a parameterized class with a single parameter, probably called output_dir (i.e inherit from param.Parameterized and not object).
  • Calling get_latest returns the last output in self.outputs but this attribute is initialized in the constructor of CollectOutputs. It will therefore fail to find new directories unless you also create a new CollectOutputs instance.
  • If this were implemented as a single class, you would still want the various information (tids, stderr, stdout etc) per path found in the output directory. For this, I would consider using a namedtuple instead of defining a new class for holding the extracted information.
  • As only the output directory is needed, this could indeed be the return type after launching which could make it even more convenient.
  • We will want some nice, sensible repr for the class(es).

This is a great starting point and I can see the utility of such a system. Here, I'll show how I imagine improving it:

>>> output = lancet.Launcher(example_name, integers, factor_cmd, 
                             output_directory='output')()
>>> output = lancet.Output('output')  # Equivalent to the return value above
>>> output.update()                   # Capture the current state of the filesystem
>>> len(output)                       # Length two because previous run exists.
2
>>> output.paths
['2015-06-22_1356-prime_quintuplet', '2015-06-22_1538-prime_quintuplet']
>>> '2015-06-22_1356-prime_quintuplet' in output
True
# Accessing tid field of named tuple for this run...
>>> output['2015-06-22_1356-prime_quintuplet'].tids 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
# The output directory of the most recent run (call it last_run maybe?)
>>> output.last.output_dir   
'output/2015-06-22_1356-prime_quintuplet'
# No need to show much else as contents is mutable (filesystem) state
>>> print repr(output)
Output(output_directory='output')

I think this design would cover everything you've suggested except do_expansion. I think, this could be the one optional parameter of Output, making it more declarative as follows:

>>> output = Output('output', expansions={'full_filename':ShellCommand.LongFilename})
# Additional field added to the named tuple using the expansion.
>>> output.last.long_filename
'/home/user/.../output/2015-06-22_1356-prime_quintuplet'

What do you think?

I am happy to immediately implement my suggestion if you like this updated proposal. I could commit such a class directly to master but then your contribution won't be properly recorded. Alternatively, maybe I could commit to your branch to update this PR - or submit a PR on your PR?!

Of course, you could have a go implementing this yourself, but if you are happy with my suggestions above, I think I could get it implemented and merged with master very quickly.

@prabhuramachandran
Copy link
Contributor Author

Thanks for the quick review. I am happy you like the general idea. Here are my comments:

  • I have not used param for the first cut as I am not familiar with it and thought that can be easily done at the end if the basic design is OK.
  • I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.
  • The idea of using a namedtuple seems excellent but my gut tells me it isn't too extensible. Thinking more about it, it makes a lot of sense as our output attributes are really immutable. Let me think more about that. In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??
  • The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type. I find this ugly. As a user I don't really care about the gory details of the name. A sequence of runs for which I can get information is perfect. For example in your example you have:
>>> output.paths
['2015-06-22_1356-prime_quintuplet', '2015-06-22_1538-prime_quintuplet']
>>> '2015-06-22_1356-prime_quintuplet' in output
True
# Accessing tid field of named tuple for this run...
>>> output['2015-06-22_1356-prime_quintuplet'].tids 

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is? In any case, what if it looked like this:

>>> output = lancet.Output('output')  # Equivalent to the return value above
>>> output.update()                   # Capture the current state of the filesystem
>>> len(output)
2
>>> output[0]
OutputData(tids=[...], specs=[...], ...)
>>> for tid, spec in zip(output[0].tids, output[0].specs):
...    print tid, spec
>>> output.paths
[...]
>>> output.last # same as output[-1] and can simply be a property.
  • Updating the state is not critical to me as the class is very lightweight but yes, it is useful to have that method (especially if this is used on a notebook) so I can easily add that.
  • CollectOutputs is definitely not very clean at all and should really have an update method if at all it will exist
  • Yes, I agree it can be returned by the Launcher -- that was my original idea but I like that the Output is reusable outside of the Launcher, which is perfect for post-processing any runs.

Long discussions like this on github are painful as I cannot quote your points. Can we do this on email? Or should I just reply to the gh email I receive?

@jlstevens
Copy link
Member

Long discussions like this on github are painful as I cannot quote your points. Can we do this on email? Or should I just reply to the gh email I receive?

You can just reply to the e-mail if that is more convenient for you. I'll keep the conversation on GitHub for now and simply make use the markdown quote syntax above.

I have not used param for the first cut as I am not familiar with it and thought that can be easily done at the end if the basic design is OK.

Yes, adding param is no problem at all.

I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.

I think we would need some very clear ideas as to how we want to extend these ideas in future before the introduction of two new classes is really justified...

In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??

I think I would consider OutputInfo ('data' suggests it is the meat of the output which won't always be the case).

The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type.

I mostly agree and that is why I proposed the .last property to make it easy to access the most recent run.

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is?

Again I agree with you here ('runs' may also be a better name). The integer indexing approach is definitely a better idea than the strings and would work nicely in conjunction with len. Working with integer run numbers would allow you to use range(len(output)) to index any available information over all the runs.

The only thing to be careful about is the ordering of the runs - in the simplest case this could be based on a simple alphanumerical sort over the run names (this would also be a chronological sort using the default timestamp format). However, I think there is better way: the launch timestamp is explicitly recorded in the info file so it would probably make a lot of sense to use that.

This way, you would be certain that output[0] is the oldest run, output[-1] is the most recent run and everything in between is also in chronological order.

Lastly, I will note that the integer scheme is not exclusive with indexing via explicit path name - __getitem__ could support both integer (list-like) and string-like (dictionary like) indexing.

@prabhuramachandran
Copy link
Contributor Author

You can just reply to the e-mail if that is more convenient for you. I'll keep the conversation on GitHub for now and simply make use the markdown quote syntax above.

I have also been tempted to make it all one class but I found that to be a bit ugly and not very extensible.

I think we would need some very clear ideas as to how we want to extend these ideas in future before the introduction of two new classes is really justified...

In any case, what would the name of this namedtuple be OutputData or LaunchData or RunData or LaunchInfo??

I think I would consider OutputInfo ('data' suggests it is the meat of the output which won't always be the case).

Alright, so technically there are two objects, Output and OutputInfo but the latter is lightweight. I like this so will modify it that way when I get the chance either today or tomorrow.

The one thing I do not like about the proposed (and current interface) is the reference to the actual paths of the output which is very error prone and ugly and quite a pain to type.

I mostly agree and that is why I proposed the .last property to make it easy to access the most recent run.

I don't like this at all. I'd rather have the paths be an attribute available like everything else, perhaps runs would be a better name as it really describes what it really is?

Again I agree with you here ('runs' may also be a better name). The integer indexing approach is definitely a better idea than the strings and would work nicely in conjunction with len. Working with integer run numbers would allow you to use range(len(output)) to index any available information over all the runs.

Right, so I will try the names and see how this works.

The only thing to be careful about is the ordering of the runs - in the simplest case this could be based on a simple alphanumerical sort over the run names (this would also be a chronological sort using the default timestamp format). However, I think there is better way: the launch timestamp is explicitly recorded in the info file so it would probably make a lot of sense to use that.

Well but the timestamps of the directory are already sorted according to the timestamp and I think it is safe to assume that the output of one launch corresponds to one problem (or is that a dangerous assumption?) I certainly think it is reasonable and will use that for the time being.

This way, you would be certain that output[0] is the oldest run, output[-1] is the most recent run and everything in between is also in chronological order.

Yes, we should get this as it currently stands and can fix bugs if we see them.

Lastly, I will note that the integer scheme is not exclusive with indexing via explicit path name - getitem could support both integer (list-like) and string-like (dictionary like) indexing.

Indeed and we could this also a bit later and just go with integers for now.

As regards the expansions, I think there may be a misunderstanding, in my use case LongFilenames are different for each invocation of the command. As a result the OutputInfo will have additional attributes corresponding to the expansions.

In summary I'll make the following changes when I am able:

  • One Output class which contains OutputInfo namedtuples for each particular "run".
  • OutputInfo has additional attributes for any expansions by default, tids, specs, output_dir, stdout, stderr, log, info and then any extra expansion keys.
  • Access to each run info via integers along with the last one.
  • Access to the paths for each "run" of the Launcher
  • Try to parametrize the class.

Have I missed anything?

@jlstevens
Copy link
Member

Your summary sounds good and I look forward to seeing the updated code!

A couple of quick comments:

  • I agree that an alphanumeric sort is sufficient, at least for now. That said, I think there is another neat and easy trick we can use! If timestamp is in the first entry in the named tuples (fairly important information in my opinion!), then simply sorting the list of named tuples would always be guaranteed to get you a chronological order. I suspect this might even be easier that trying to get them in order based on the alphanumeric ordering of the launch directory names...
  • I wouldn't bother implementing the last property if the integer indexing work (output[-1] would be the same as .last).
  • In the end, I think we should probably call each index a launch instead of a run as I think this would be more consistent with the existing terminology we have used. I would rather not have something that sounds like a new concept when there really isn't one!
  • Related to the above, the named tuple should probably be called LaunchInfo because the tuple is containing part of the output captured by Output as opposed to all the output.

Other than these minor points, I think this will be a useful class to introduce without requiring much more effort to implement. Great!

- One `Output` class which contains `LaunchInfo` namedtuples for each
  particular "launch".
- `OutputInfo` has additional attributes for any expansions by default,
  `timestamp, path tids, specs, stdout, stderr, log, info` and
  then any extra expansion keys.
- Access to each launch info via integers via `__getitem__`.
- Ability to iterate over the output object.
- Parametrize the class.
- Remove `CollectOutputs` class.
@prabhuramachandran
Copy link
Contributor Author

I've implemented everything above but want to write some tests. I'll do that tomorrow, feel free to take a look.

@jlstevens
Copy link
Member

At a glance it looks great!

Two immediate comments:

  • I would remove lines like this one as we don't have anything else like it in the code:
##### object Protocol ####################################################
  • I think I would be happy to merge this into launch.py because 1) Launcher could return such an object 2) I don't think this class is too long or really justifies a new file.

I'll have a closer look shortly, but I think it is pretty much what I hoped/expected.

@jlstevens
Copy link
Member

Travis is failing due to some whitespace issue in the doctest (I think).

from glob import glob
import json
import os
from os.path import isdir, join, splitext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would generally prefer just to import os and fully qualify as appropriate...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, then we should technically import os.path many many years ago this used to be a problem but maybe I am just getting old. This SO article is not conclusive:

http://stackoverflow.com/questions/2724348/should-i-use-import-os-path-or-import-os

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it comes down to personal taste - personally I've gotten so used to typing os.path that it has become automatic. My recommendation here is not because it is necessarily the best approach overall but because this is how it was done everywhere else in Lancet. Changing these imports throughout is something I would consider for a separate PR...

@prabhuramachandran
Copy link
Contributor Author

I'll look at this tomorrow. I like the protocol lines as it makes code a lot easier to follow for me (and I am used to it) but it isn't consistent with your code so can remove it. I'm happy to merge it into launch.py although I think that file is quite big now at close to 1000 lines! I will let you modify the launchers. I'll fix travis tomorrow.

if isdir(full_path):
launches.append(self._get_launch_info(full_path))
launches.sort()
self.launches = launches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably do this instead:

self.launches = sorted(launches)

Makes very little difference but it does save one line...

@jlstevens
Copy link
Member

Ok, I've only made two comments in the code and otherwise I think it is ready to merge (into launch.py) as soon as Travis is happy!

I think I would put this just before the Launchers:

#===============#
# Output Helper #
#===============#

# <Your code here>

#===========#
# Launchers #
#===========#

Thanks for doing this! I'll return a suitable Output instance from the launchers after the merge...

@prabhuramachandran
Copy link
Contributor Author

OK, I have made the requested changes. I haven't added any tests and am not sure I will have time to add them soon but am happy to do that if needed.

@prabhuramachandran
Copy link
Contributor Author

I've also pushed a couple of simple tests now, if Travis is happy this is now RTM.

@jlstevens
Copy link
Member

Looks good to me! Marco says he might have a few comments that he will add here later...

@prabhuramachandran
Copy link
Contributor Author

OK, can't those be addressed in a separate PR?

@jlstevens
Copy link
Member

Ok, I've quickly gone through it with Marco and he is also satisfied.

I'll go ahead and merge it now. Thanks again for your contribution!

jlstevens added a commit that referenced this pull request Jun 24, 2015
Introduced Output class to conveniently collect information across launches
@jlstevens jlstevens merged commit 09c8fd9 into ioam:master Jun 24, 2015
@prabhuramachandran prabhuramachandran deleted the collect-outputs branch June 25, 2015 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants