A powerful and flexible Python pipeline framework for parallel processing with progress tracking capabilities.
- Parallel Processing: Support for both multi-threading and multi-processing
- Progress Tracking: Built-in progress bars using tqdm with total or per-stage tracking
- Flexible Worker System: Easy-to-implement worker classes for custom processing
- Error Handling: Robust error handling with detailed error propagation
- Ordered/Unordered Results: Option to maintain input order or get results as they complete
- Stage-based Pipeline: Chain multiple processing stages together
- Async Support: Built-in support for asynchronous processing
- Mixed Mode Processing: Combine thread and process modes in the same pipeline
pip install mpipelineHere's a simple example of using MPipeline with thread mode:
from mpipeline import Pipeline, Worker, Stage
# Define a worker
class NumberProcessor(Worker[int, int]):
def doTask(self, inp: int) -> int:
return inp * 2
# Create and run a pipeline with thread mode
pipeline = Pipeline(
Stage(NumberProcessor, worker_count=4, mode='thread')
)
# Process numbers with progress tracking
results = list(pipeline.run(
range(100),
ordered_result=True,
progress='total' # Show overall progress
))# Define multiple worker stages
class NumberGenerator(Worker[int, int]):
def doTask(self, inp: int) -> int:
return inp * 2
class DataProcessor(Worker[int, float]):
def doTask(self, inp: int) -> float:
return inp * 1.5
# Create multi-stage pipeline with mixed thread/process modes
pipeline = Pipeline(
Stage(NumberGenerator,
worker_count=4,
mode='thread', # Use threading for I/O-bound tasks
worker_kwargs={'name': 'Generator'})
).then(
Stage(DataProcessor,
worker_count=2,
mode='process', # Use processes for CPU-bound tasks
multiprocess_mode='spawn', # Use spawn for better compatibility
worker_kwargs={'name': 'Processor'})
)
# Run pipeline with stage-level progress tracking
results = pipeline.run(
range(50),
ordered_result=True,
progress='stage' # Show per-stage progress
)class ErrorProneWorker(Worker[float, str]):
def doTask(self, inp: float) -> str:
if inp > 20:
raise ValueError(f"Input too large: {inp}")
return f"Processed: {inp:.1f}"
try:
pipeline = Pipeline(
Stage(ErrorProneWorker, worker_count=2, mode='thread')
)
results = list(pipeline.run(range(25)))
except Exception as e:
print(f"Error caught: {e}")Pipeline(stage: Stage[T, Q]): Create a new pipeline with an initial stagethen(stage: Stage[Q, Z]): Add a new stage to the pipelinerun(inputs, ordered_result=True, progress=None): Run the pipelineprogress: Progress tracking mode'total': Show overall progress'stage': Show per-stage progressNone: No progress tracking
Configure a pipeline stage with the following parameters:
worker_class: The Worker class to use for this stageworker_count: Number of parallel workers (default: 1)mode: Processing mode - 'thread' or 'process' (default: 'thread')multiprocess_mode: Multiprocessing mode when using 'process' - 'spawn' or 'fork' (default: 'spawn')worker_args: Positional arguments for worker initializationworker_kwargs: Keyword arguments for worker initialization
Base class for implementing custom workers:
doTask(inp: T) -> Q: Process a single input itemdoDispose(): Called once after processing ends
- Use
mode='thread'for I/O-bound tasks (network, disk operations) - Use
mode='process'for CPU-bound tasks (heavy computation) - Set
ordered_result=Falsefor better performance when order doesn't matter - Use
multiprocess_mode='spawn'for better cross-platform compatibility - Adjust
worker_countbased on your system's resources and task type - Choose appropriate progress tracking:
- Use
progress='total'for simple overall progress - Use
progress='stage'when monitoring individual stage performance - Use
progress=Nonefor maximum performance
- Use
Contributions are welcome! Please feel free to submit a Pull Request.
