Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/implement streaming approaches, more generally #393

Closed
ctb opened this issue Apr 19, 2014 · 8 comments
Closed

Investigate/implement streaming approaches, more generally #393

ctb opened this issue Apr 19, 2014 · 8 comments

Comments

@ctb
Copy link
Member

ctb commented Apr 19, 2014

(Updated in #1206)

In partial response to ivory.idyll.org/blog/2014-pycon.html, can we put our protocols on a streaming basis by systematically introducing streaming functionality into khmer?

There are ways to do this even for e.g. Trimmomatic using clever Unix socket tricks.

Also see #149.

@mr-c
Copy link
Contributor

mr-c commented Sep 2, 2014

can we put our protocols on a streaming basis by systematically introducing streaming functionality into khmer?

Yes.

  • Test the scripts to see which support stdout / stdin (either by omission, passing in a dash "-" or /dev/std{in,out})
  • Group non-compliant scripts by reason: can't specify the output name (due to the use of basenames; lack of support to write to a single file when given multiple inputs; unnecessary file seeking; other)
  • Make improvements as needed

@mr-c mr-c added this to the 1.2+ milestone Sep 2, 2014
@ctb
Copy link
Member Author

ctb commented Sep 2, 2014

On Sep 2, 2014, at 7:19 AM, Michael R. Crusoe notifications@github.com wrote:

can we put our protocols on a streaming basis by systematically introducing streaming functionality into khmer?

Yes.

• Test the scripts to see which support stdout / stdin (either by omission, passing in a dash "-" or /dev/std{in,out})
• Group non-compliant scripts by reason: can't specify the output name (due to the use of basenames; lack of support to write to a single file when given multiple inputs; unnecessary file seeking; other)
• Make improvements as needed

The question of how to handle it at the command level is one component of the issue, but there are several others —

first, some of our algorithms don’t handle streaming properly yet. (I’m looking at you, filter-abund.) Here I think judicious refactoring of internal code to support iterator-style consumption and production of reads will be needed.

second, some of our approaches are not single-pass, and will require “holding cells” for some data (filter-abund, again). Thinking about how to handle this cleanly has been a bit challenging, and will require some playing around.

and then even if we figure all of this out, it’s not clear to me that supporting streaming by stdin/stdout is going to be terribly efficient. I’d like to support multi-threaded and multi-file reading (which is impossible via stdin). It should also be possible to support ad hoc composition of functions such that we could be flexible in distributed situations (e.g. machine A does diginorm, machine B does filter-abund, machine C does assembly).

All of this will require some design work and the C++ read handling code could also use some refactoring...

It’d be good to have someone prototype this out, but it seems beyond the scope of any current in lab effort, at least for now. But it’s a neat CS-y research project.

—titus

@ctb
Copy link
Member Author

ctb commented Sep 13, 2014

Re second issue (holding cells) see #601 for proposed approach.

@mr-c
Copy link
Contributor

mr-c commented Sep 13, 2014

+1 to holding cells. Will need to write up how to redirect them to use a solid state drive, ram drive, et cetera for users.

@mr-c
Copy link
Contributor

mr-c commented Oct 31, 2014

To address https://github.com/ged-lab/khmer/pull/644/files#r19680360
khmer.file.check_space() needs to skip devices; see https://docs.python.org/2/library/stat.html#stat.S_ISBLK and S_ISCHR

@ctb
Copy link
Member Author

ctb commented Mar 12, 2015

@ctb
Copy link
Member Author

ctb commented Mar 15, 2015

@ctb
Copy link
Member Author

ctb commented Jul 22, 2015

Streaming fixes & tests in #1186; this will finish off most of the straightforward practical issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants