-
Notifications
You must be signed in to change notification settings - Fork 129
Add SupportsParallel
property to CmdletBinding
#206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,310 @@ | ||
--- | ||
RFC: RFCnnnn | ||
Author: Kirk Munro | ||
Status: Draft | ||
SupercededBy: | ||
Version: 0.1 | ||
Area: Engine | ||
Comments Due: July 26, 2019 | ||
Plan to implement: I'm willing to help | ||
--- | ||
|
||
# Generalized Parallelism | ||
|
||
The inspiration for this RFC came from [RFC 194](https://github.com/PowerShell/PowerShell-RFC/pull/194). | ||
|
||
Recently PowerShell has received some enhancements targeted toward more | ||
parallel invocation of PowerShell, such as the `&` background operator and the | ||
`PowerShell.InvokeAsync` and `PowerShell.StopAsync` async methods. There have | ||
also been an RFC for [parallelism in `foreach`](https://github.com/PowerShell/PowerShell-RFC/pull/174) and another RFC for [parallelism | ||
in `ForEach-Object`](https://github.com/PowerShell/PowerShell-RFC/pull/194) (RFC 194). | ||
|
||
[RFC 194](https://github.com/PowerShell/PowerShell-RFC/pull/194) focuses on adding parallelism support to one cmdlet: ForEach-Object. | ||
While that is useful, with a little more effort and planning there may be a | ||
better way to approach that problem such that any advanced function or cmdlet | ||
could have parallel invocation support. This RFC is being put on the table for | ||
consideration of the broader need for parallelism in PowerShell. | ||
|
||
While it is somewhat in competition with [RFC 194](https://github.com/PowerShell/PowerShell-RFC/pull/194), the end result would be very | ||
similar. `ForEach-Object` would have parallel invocation support, as would any | ||
advanced function or cmdlet that was designed to have such support. By laying a | ||
foundation that can be applied to any cmdlet, and then applying the new | ||
capabilities that foundation brings to PowerShell to the `ForEach-Object` | ||
cmdlet, we end up with a common framework for faster automation where | ||
asynchronous invocation would be advantageous that can be applied to any | ||
advanced function or cmdlet. | ||
|
||
Note that [the RFC in PR 204](https://github.com/PowerShell/PowerShell-RFC/pull/204) is very closely related to this RFC and both should | ||
be implemented at the same time, assuming that both are approved. If you | ||
haven't already, have a quick look at that RFC and share feedback on it. It's | ||
very straightforward. | ||
|
||
[The RFC in PR 205](https://github.com/PowerShell/PowerShell-RFC/pull/205) is also related to this | ||
RFC and should be another quick read. | ||
|
||
## Motivation | ||
|
||
As an author of PowerShell commands, | ||
I can turn on parallelism support in my commands and implement the command logic accordingly, | ||
so that scripters using my command can build faster automated solutions that scale. | ||
|
||
## User Experience | ||
|
||
To demonstrate the user experience for command authors, let's take a closer | ||
look at the `Invoke-Command` cmdlet. `Invoke-Command` has parallelism built-in | ||
already, but when you inspect the metadata, you'll see that there are some gaps | ||
in the implementation across its 17 parameter sets. These gaps result in | ||
inconsistencies that could have been avoided completely by having parallelism | ||
support for commands implemented as a feature that can be turned on in | ||
parameter sets where it is needed. | ||
|
||
> Please note that `Invoke-Command` creates remoting jobs which are background | ||
jobs run in child processes, not thread jobs that are run in separate runspaces | ||
on separate threads in the same process; however, that implementation predates | ||
the PSThreadJob module, so don't let that distract you. The goal here is easier | ||
parallelism in any command where it is appropriate, and in our cloud-centric | ||
work environment, there are many commands where parallelism would be useful. | ||
The `Invoke-Command` example is still a good one to use because it is already | ||
familiar. | ||
|
||
To understand what I'm talking about with `Invoke-Command`, invoke the | ||
following in PowerShell: | ||
|
||
```powershell | ||
function HasParameter { | ||
param( | ||
[System.Management.Automation.CommandParameterSetInfo]$ParameterSet, | ||
[string]$ParameterName | ||
) | ||
@($ParameterSet.Parameters.where{$_.Name -eq $ParameterName}).Count -gt 0 | ||
} | ||
$cmd = Get-Command Invoke-Command | ||
$cmd.ParameterSets.foreach{ | ||
[pscustomobject]@{ | ||
ParameterSet = $_.Name | ||
AsJob = HasParameter $_ 'AsJob' | ||
JobName = HasParameter $_ 'JobName' | ||
ThrottleLimit = HasParameter $_ 'ThrottleLimit' | ||
TimeoutSecs = HasParameter $_ 'TimeoutSecs' | ||
} | ||
} | ft | ||
``` | ||
|
||
This outputs the following: | ||
|
||
```none | ||
ParameterSet AsJob JobName ThrottleLimit TimeoutSecs | ||
------------ ----- ------- ------------- ----------- | ||
InProcess False False False False | ||
FilePathRunspace True True True False | ||
Session True True True False | ||
ComputerName True True True False | ||
FilePathComputerName True True True False | ||
Uri True True True False | ||
FilePathUri True True True False | ||
VMId True False True False | ||
VMName True False True False | ||
FilePathVMId True False True False | ||
FilePathVMName True False True False | ||
SSHHost True False False False | ||
ContainerId True True True False | ||
FilePathContainerId True True True False | ||
SSHHostHashParam True False False False | ||
FilePathSSHHost True False False False | ||
FilePathSSHHostHash True False False False | ||
``` | ||
|
||
That output highlights several limitations that come from the way parallelism | ||
is implemented in `Invoke-Command`: | ||
|
||
1. Only half of the parameter sets that support `-AsJob` allow you to name the | ||
job that is created with a `-JobName` parameter, but they all should support | ||
that. | ||
1. Some of the parameter sets that support parallelism don't allow you to set a | ||
throttle limit, but they all should. | ||
1. `Invoke-Command` uses multi-process parallelism with the same parameter | ||
names that are proposed for multi-threaded parallelism in [RFC 194](https://github.com/PowerShell/PowerShell-RFC/pull/194). This is | ||
inconsistent and bound to cause some confusion. | ||
1. None of the parameter sets that support parallelism allow you to set a | ||
timeout, because these are implemented to run in background jobs, which don't | ||
timeout. If they were implemented to run as threadjobs instead, timeouts become | ||
possible and so they should consider timeouts in a multi-threaded | ||
implementation. | ||
|
||
If `Invoke-Command` was implemented instead using a setting in a `CmdletBinding` | ||
attribute that automatically took care of the parameters used for parallel | ||
invocation of certain parameter sets, then the parameters for parallelized | ||
invocation would always be properly defined, and they would always work the | ||
same way for this cmdlet and for every other advanced function or command that | ||
supports parallelization. | ||
|
||
With all of that explained, below you'll see a simplified example showing how | ||
the `Invoke-Command` command might be defined to support parallelization with | ||
common parameters. For that example, I only included a few parameter sets to | ||
keep it simple (there is no need to include all 17 parameter sets here). | ||
|
||
```powershell | ||
function Invoke-Command { | ||
[CmdletBinding( | ||
DefaultParameterSetName='InProcess', | ||
# NEW! SupportsParallel defines the parameter sets that support | ||
# parallel execution of the process block in thread jobs that are | ||
# automatically created. All other parameter sets would result in | ||
# synchronous execution of the process block in the current runspace. | ||
# Each parameter set is named, separated by commas. If the command | ||
# should support parallel execution in only one of the parameter sets, | ||
# the array enclosures can be omitted. If the command should support | ||
# parallel execution in all parameter sets, the command author can | ||
# simply drop the equals operator and the values after it. | ||
SupportsParallel=@('FilePathRunspace','Session','VMId') | ||
)] | ||
param( | ||
# Define parameters here | ||
|
||
# NEW! Any parameter sets that support parallel invocation would | ||
# automatically receive additional common parameters. Like other common | ||
# parameters, these could be used in Intellisense and tab completion, | ||
# but they wouldn't be defined in a param block. They would be assigned | ||
# automatically to the appropriate parameter sets by PowerShell. | ||
# [-Parallel] : A switch used to run a command in parallel with the | ||
# default configuration for parallel command execution. | ||
# This parameter is not required if any of the following | ||
# parameters are used in the invocation. | ||
# [-ThreadLimit <int>] : An int used to define the maximum number of | ||
# threads to use during the parallel command | ||
# execution (default value depends on system | ||
# configuration) | ||
# [-TimeoutSecs <int>] : An int used to define the maximum time allowed | ||
# for all threads to complete. This parameter | ||
# cannot be used in conjunction with the next | ||
# two parameters. | ||
) | ||
begin { | ||
# No changes here, this would be invoked synchronously in the current | ||
# session. | ||
} | ||
process { | ||
# NEW! The way this is invoked depends on whether or not the parameter | ||
# set that was used supports parallelization or not. | ||
# | ||
# For a parameter set that does not support parallelization, this would | ||
# be invoked synchronously. | ||
# | ||
# For a parameter set that supports parallelization, this would be | ||
# invoked in a thread job, and the command would wait for all thread | ||
# jobs to complete before continuing with end (and dispose) blocks. | ||
|
||
# NEW! Unlike Start-Job commands where you must use the "using:" prefix | ||
# on variables to access variables that are defined outside of the | ||
# scope of the script block, all variables that are defined in the | ||
# current scope would be accessible when the command is invoked with | ||
# parallel processing without having to use the "using:" prefix. This | ||
# is a smarter design for working with variables from other runspaces, | ||
# and allows for easier transition of a command that does not support | ||
# parallel processing into a command that does support parallel | ||
# processing. The "using:" qualifier should only be used when you want | ||
# to reference a variable that is outside of the scope of the local | ||
# function. | ||
|
||
# NEW! "using:" does not allow for transport of script blocks. This was | ||
# done as a security measure (see [PowerShell Issue 9703](https://github.com/PowerShell/PowerShell/issues/9703) for details); | ||
# however, there are valid cases where script blocks must be | ||
# transportable into jobs, and this feature is one such case. Any | ||
# variables that contain script blocks (i.e. parameters, variables | ||
# defined in a begin block, etc.) must be made available to the thread | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate on how you envision this working? Currently scriptblocks are bound to either the "global" or a module's SessionState. Each scriptblock can read and alter the SessionState to which it is bound when it is invoked. This works fine in a single-threaded scenario because only one scriptblock at a time can alter a given SessionState. With a second thread, however, two threads could be accessing the same SessionState and currently results in a crash. That is the topic of PowerShell/PowerShell#7626 which does not seem to have an obvious resolution. |
||
# jobs where process is invoked. Additionally, local functions defined | ||
# in a begin block must be made available to the thread jobs where | ||
# process is invoked. In general, local variables (from parameters or | ||
# the begin block) and any resources that are created in the begin | ||
# block such as functions, etc. must be made available in the runspaces | ||
# where the threadjobs will run. | ||
} | ||
end { | ||
# No changes here, this would be invoked synchronously in the current | ||
# session. | ||
} | ||
} | ||
|
||
$twoSessions = @($session1, $session2) | ||
Invoke-Command -Parallel -Session $twoSessions -ScriptBlock { | ||
'This will be run in a thread job for each session.' | ||
} | ||
|
||
Invoke-Command -Session $twoSessions -ScriptBlock { | ||
'This will be run once for each session, sequentially.' | ||
} | ||
``` | ||
|
||
```output | ||
This will be run in a thread job for each session. | ||
This will be run in a thread job for each session. | ||
This will be run once for each session, sequentially. | ||
This will be run once for each session, sequentially. | ||
``` | ||
|
||
## Specification | ||
|
||
For this to work, the following rough list of changes would have to be implemented: | ||
|
||
* update `CmdletBinding` to store the names of parameter sets that support | ||
parallel execution in the `PSCmdlet` object. | ||
* update the common parameter assignment to add `-Parallel`, `-ThreadLimit`, | ||
and `-TimeoutSecs` parameters to parameter sets that support `SupportsParallel`. | ||
* add logic to the command processor that identifies the local variables and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by "local variables". PowerShell doesn't really have the concept of "local variables". Rather, it employs (among other things) a multitude of nested scopes hosted in SessionStates that may or may not themselves be related. AFAICT each variable encountered during scriptblock invocation is resolved by following a search algorithm. In general, I don't think something like "identifying the local variables" is computable. For example, in $g = Get-SomeScriptBlockGenerator
1..10 | Invoke-Command { . $g.GetNextScriptBlock() } The SessionState (and therefore variables in scope) of the scriptblock returned by .GetNextScriptBlock() cannot be discovered prior to call the .GetNextScriptBlock(). There is another way though, I think. AFAICT the search "rules" for a runspace could be expanded to include the SessionState of the call site of Invoke-Command of the parent runspace. |
||
any resources such as functions defined in the `begin` block that need to be | ||
made available to a `ThreadJob`, and make them available there as if the `using:` | ||
qualifier was used on them. | ||
* update the command processor to support invocation of the `ProcessRecord` | ||
method inside of a `ThreadJob` when the command is invoked with one of the new | ||
common parameters. | ||
* update the command processor such that it waits for `ThreadJob` instances it | ||
creates to complete, and then writes their results to the pipeline as they | ||
finish execution. | ||
* update `ForEach-Object` to be the first command that support `SupportsParallel`, | ||
so that users have an easy way to create multi-threaded, multi-runspace | ||
execution paths in PowerShell. | ||
|
||
## Alternate Proposals and Considerations | ||
|
||
### Running a parallelized command asynchronously | ||
|
||
Parallelization is about being able to perform multi-threaded processing of | ||
objects in a pipeline. This is what is natively supported by the enhancements | ||
outlined in this RFC. The RFC that originally inspired this approach included | ||
an `-AsJob` parameter that would allow for asynchronous invocation of the process | ||
block of `ForEach-Object`. There are two problem with `-AsJob` in this scenario: | ||
|
||
1. (Minor) Other commands use `-AsJob` as a parameter today to return a | ||
background job that is run in a separate child process, which in turn may | ||
contain multiple child background jobs. Having this parameter also used to | ||
return one or more thread jobs is bound to confuse some people. The solution | ||
here is straightforward: use a `-AsThreadJob` parameter name instead of `-AsJob`. | ||
1. (Major) Pipelines in PowerShell have a very specific processing sequence. In | ||
general, first the `begin` blocks in the pipeline are processed, then the `process` | ||
blocks, and last the `end` blocks. In the future we will also see a `dispose` block | ||
that gets processed after the end block, or on terminating error. Introducing a | ||
parameter that makes a command optionally return a job from a process block means | ||
that command cannot properly handle logic in the `end` or `dispose` blocks, | ||
because those should only be invoked after the `process` block work has | ||
finished execution, and returning a job defers that execution until later. | ||
|
||
The second issue is so significant that it highlights a very important detail | ||
when considering adding parallel support to a command: | ||
|
||
_Commands that support parallel processing must be run to completion so that | ||
`end` and `dispose` blocks do their work in the proper sequence, and any | ||
decision to make the pipeline in which those commands are used run as a | ||
background job or as a thread job needs to be made outside of the pipeline._ | ||
|
||
With that in mind, parameters related to creating a job have been excluded from | ||
this RFC, and a related RFC has been created to facilitate invocation of a | ||
pipeline in a thread job by using an operator that is external to the pipeline. | ||
|
||
### Runspace pooling (and other possible future parallel improvements) | ||
|
||
It may be desired to allow users to create a pool of runspaces, and then cycle | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is an understatement. From the testing I have performed any parallelization without sophisticated runspace initialization and reuse is basically unusable save for a few contrived scenarios. Without such runspace management, for most scenarios so much time is spent waiting for modules to load that that wait far surpasses the gains of parallelization. I think this repro from PowerShell/PowerShell#7524 is the most direct demonstration of the problem. |
||
through those runspaces as thread jobs are executed in the pipeline. Since this | ||
proposal uses the common parameter approach, it would be easy to add more | ||
common parameters that extend the support to include specific runspace pooling | ||
in the future, and all commands that support parallel processing will | ||
automatically gain the benefit of that work if it is added. |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by "the current scope" here? Do you mean the SessionState at the Invoke-Command call site? If so, how do you envision variables defined at the call site will be resolved from the perspective of
process{}
? Would the SessionState to which theprocess{}
scriptblock is bound be a "child" of the SessionState from which Invoke-Command was invoked? (This would be similar to the way module SessionStates often have the "global" SessionState as a parent.)(For readers unfamiliar with PowerShell's SessionState scoped variable search "rules", @lzybkr provides an explanation in #6139. The important thing to understand is that those "rules" result in variable resolution behavior that is substantially different from C#.)