-
Notifications
You must be signed in to change notification settings - Fork 1
Dataflow Programming Concepts
This page introduces some of the essential concepts on which RestFlow is based. RestFlow uses notions from dataflow programming to tie together software modules (standalone programs, software libraries, scripts, services, etc) implemented using more traditional programming languages, i.e. those based on the concept of variables. The ideas described below are elaborated in the book, Flow-Based Programming (J. Paul Morrison, 1994).
Many programs and scripts written for scientific applications are based on the fundamental ideas of variables for holding data, and function calls that operate on that data. Variables are locations in computer memory with names that serve as shorthands for their addresses. Function calls provide ways for different parts of a program to share access to data using variables.
Programs written this way can execute efficiently because this approach matches the native, low-level computational model of most computers. Unfortunately, the reason these named memory addresses used to share data between different parts of a program are called variables is that there often is little insuring that the values stored at these addresses remain constant. This inherent mutability of variable values introduces significant challenges when different parts of a program are meant to operate concurrently (using multiple threads, for example).
Concurrent applications are much easier to develop when mutable variables are used only within chunks of code that operate as a single process or thread, and when only immutable data items are passed between units of software meant to run simultaneously. Dataflow programming takes this approach of moving only immutable data, and adds to it the restriction that data move in a single direction only. Dataflows are one way.
One consequence of programming with dataflows rather than variables is that the overall structure of a program is very different. When variables are used to tie together parts of a program, the resulting software tends to be structured hierarchically, i.e. top-level functions call other functions, which may call yet other functions. For example, if one modularizes program A, i.e., breaks program A into more meaningful, more maintainable, more reusable units of code, the resulting program might then comprise three functions, A', B, and C, where A' calls B and C; or alternatively, A' calls B, which in turn calls C. When calling B, A' passes data to B and receives the results of B's computations back from B. Thus data passes both up and down the chain of function calls. Moreover, A' typically waits for B to finish executing before continuing.
Note that these observations typically hold even for variable-based programs written using object-oriented programming languages. Variables and functions may be bundled together in classes, but variables are still employed, functions (methods) are still called, and program execution is still based on passing data to and receiving results back from invoked functions.
In contrast, programs based on the dataflow metaphor typically send data from one unit of code to another in one direction only. A program P decomposed using the dataflow approach might yield three software components Q, R, and S with data flowing from Q to R, and from R to S. Data need not flow back from S to R, nor from R to Q. The modules Q, R, and S thus are not organized hierarchically but as peers. Furthermore, there typically is no need for Q to wait for R to complete (as A' typically must wait for B above). Q does not expect anything back from R, so Q is free do something else. In other words, unlike A', B, and C in a variable-based program, Q, R, and S can easily operate concurrently. If Q operates sequentially on a series of inputs supplied to it, and it similarly sends its results to R incrementally as well, then there is nothing to prevent Q from working on one set of data while R is working on the partial results Q produced earlier. The result is pipeline parallelism, and this sort of concurrency often is achieved simply by modularizing a program according to the dataflow paradigm.
Concurrency is not only easier to achieve via dataflows, it also typically is safer than that achieved via variable-based programming. Again, this is because data passed between units of code in a dataflow program are held immutable. Once Q has sent data to R, there is no further need for Q to change that data. Nor does R need to change the values of the data it receives. All Q, R, and S need to communicate are the values of the data items they produce and consume. When data passed between modules is held constant, much of the risk of data corruption and deadlock is eliminated, along with many potential sources of race conditions.
Of course, the dataflow paradigm described above is completely familiar to scientists comfortable with running programs in a Unix or Windows command-line environment. Command-line users often string together a number of commands or programs with the pipe (|) symbol between them, and the command shell conveniently routes the outputs of each program in this pipeline to the inputs of the program that follows it. Furthermore (and for the reasons summarized above), these shells are able easily and safely to run all of the programs in the pipeline at the same time with data flowing between the running programs. So how is RestFlow different from programming with pipes?
The most important difference is that command-line pipes are only convenient for linear pipelines where each program in the pipeline takes in one stream of inputs, produces one stream of outputs, and executes exactly once (although often incrementally) during the execution of the pipeline as a whole. When more complex assemblies of programs are needed, the user generally must fall back on the full feature set of the shell languages (csh, bash, etc) or employ a general purpose scripting languages such as Perl or Python. These scripting languages usually are variable-based and thus share the limitations described above. The dataflow metaphor is lost as soon as the program assembly one wishes to create fails to fit into a single command to the shell.
The difference with RestFlow is that it provides a language for specifying how data is to flow between a set of programs organized in any topology whatsoever. Scripts or programs in these 'hyper-pipelines' may accept more than one flow of input data, each coming from a different upstream program, and similarly may produce multiple flows of outputs. Any particular flow of outputs may be routed to any number of downstream programs, and dataflows may even loop back from downstream to upstream programs to achieve iterative refinement of results, for example. Finally, the programs wired together via these flows may be invoked multiple times during a single execution of the overall assembly.
Another important difference between RestFlow and command pipelines is that the data passing between programs (the intermediate data) in a RestFlow pipeline can be named and (optionally) saved for later review in automatically organized directory structures.