-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async scan #3724
async scan #3724
Conversation
def _scandir(future, path, request): | ||
"""Callback helper for scandir().""" | ||
future.set_result([ | ||
Path(path, directory.name) | ||
for directory in request.result | ||
]) | ||
|
||
|
||
async def scandir(path): | ||
"""Asynchronous directory listing using pyuv.""" | ||
ret = asyncio.Future() | ||
|
||
loop = pyuv.Loop.default_loop() | ||
pyuv.fs.scandir(loop, str(path), callback=partial(_scandir, ret, path)) | ||
loop.run() | ||
|
||
return await ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrapper which allows us to use pyuv
with asyncio
.
f67a399
to
639259a
Compare
Does it need to be compatible? |
That is the question, people may well be scraping |
Is there some machine-readable output format, say JSON? Might be better to provide something like that rather than trying to maintain backward compatibility for human-intended output. |
Yes, both the new and old scan clients provide a JSON format, however, I've broken that too: $ cylc scan -t json
[
[
"generic",
"olivers-macbook-pro.local",
"43098",
"43094",
"5",
{
"name": "generic",
"owner": "oliver",
"version": "8.0a2"
}
]
]
$ cylc scan-nt -t json
[
{
"name": "generic",
"path": "/Users/oliver/cylc-run/generic",
"contact": "/Users/oliver/cylc-run/generic/.service/contact",
"CYLC_API": "5",
"CYLC_SSH_USE_LOGIN_SHELL": "True",
"CYLC_SUITE_HOST": "olivers-macbook-pro.local",
"CYLC_SUITE_NAME": "generic",
"CYLC_SUITE_OWNER": "oliver",
"CYLC_SUITE_PORT": "43098",
"CYLC_SUITE_PROCESS": "54860 /Users/oliver/miniconda3/envs/cylc8/bin/python3.7 /Users/oliver/miniconda3/envs/cylc8/bin/cylc-run generic",
"CYLC_SUITE_PUBLISH_PORT": "43094",
"CYLC_SUITE_RUN_DIR_ON_SUITE_HOST": "/Users/oliver/cylc-run/generic",
"CYLC_SUITE_UUID": "33000c07-565e-4d93-84f4-39d6f315bf31",
"CYLC_VERSION": "8.0a2"
}
] It would be possible to re-instate the old JSON format Note that the addition of the publisher port to this output has already broken JSON scan cylc7/8 compatibility. |
The new format is much nicer though - much easier to My 2 cents - you're allowed to break compatibility at major versions. That kind of opportunity doesn't happen very often, so make the most of it. (PS. you could use |
Sorted the keys, good suggestion. I'd rather not go back to the old JSON format as its less explicit and more fragile, but added it to the questions above. The other format likely to be used by scrapers is "raw" $ cylc scan -t raw
<suite>|<user>|<host>|port|<port> I'd rather ditch it but it's trivial to re-implement. |
fd7ed4d
to
8e33040
Compare
I don't mind breaking the old |
b2483e8
to
5b1c6ba
Compare
24e3e1e
to
48be6ca
Compare
$(cylc scan --color=never --publisher -n "${SUITE_NAME}" \ | ||
| awk -F' ' -v RS='\n([ \t]*\n)+' '{print $1 " " $3}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't need to provide host:port via cylc scan
, the internal logic does this for us.
48be6ca
to
a007c78
Compare
|
||
|
||
@pipe | ||
async def scan(run_dir=None, scan_dir=None, max_depth=4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max_depth
is a new safety feature to avoid going too deep down the rabbit warren in the event that:
- Someone puts something which isn't a workflow into their cylc-run dir (a not entirely uncommon occurrence).
- Incase the way we recognise flows changes (e.g. the suite.rc file gets renamed or converted to a Python file).
What is a sensible default value for this?
Set as 4 for now as that's what's required by cylc functional test battery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLI option? or too rare to care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably too rare to care, or at least the default should be sufficient that we wouldn't expect users to hit it.
I've managed to do something to anger The details of the new |
bd5b9b7
to
855161c
Compare
855161c
to
1ad05e6
Compare
In your example the generator (
You have pointed out a perfectly legit issue with the pipe logic which is essentially depth-first which means we won't get any real concurrency. |
Ok, moved to a queue-based system which allows the pipe to follow paths which aren't depth-first. Here's an example which inserts random sleeps into each stage of the pipe: import asyncio
from random import random
from cylc.flow.async_util import pipe
@pipe
async def a_range(x):
for y in range(x):
await asyncio.sleep(random())
print(f'# a_range -> {y}')
yield y
@pipe
async def a_double(x):
print(f'# a_double({x})')
await asyncio.sleep(random())
return x ** 2
@pipe
async def a_str(x):
print(f'# a_str({x})')
await asyncio.sleep(random())
return str(x)
async def consumer():
pipe = a_range(5) | a_double | a_str
async for item in pipe:
print(
f'=> {repr(item)}'
)
asyncio.run(consumer()) If it's concurrent you should get a different order of events every time with results not necessarily yielded in order. run 1:
run 2:
Setting all of the sleeps to
|
Well that certainly sped things up; my "speak" example now runs in 4 seconds instead of 20. However, your new implementation is what I referred to in Element chat yesterday to as the "extreme concurrency" model. This is more concurrency than a pipeline is supposed to have, which I think is fine (and actually desirable) for the scan application, but it kind of calls into question the pipe name and special pipeline syntax doesn't it? Because it doesn't preserve order coming out the end of the pipe. With random sleeps in my example, I get (e.g.):
To belabor the point, in a shell pipeline I'd expect the file cat foo | tr b B | tr B b > bar but your pipe would reorder the lines if some of them took much longer to "translate" than others. (Bad example if there are other B's in the file, but you get my point). |
In Cylc graph notation, here's a pipeline
so C.3, B.2, and A.1 (e.g.) can run concurrently (they're at different points along the one pipe) but A.1, A.2, A.3 can't (that would be 3 concurrent pipes). But you've done this, which is 3 pipes:
|
Maybe we could support both modes, to justify the cool syntax? Or both separately, with a slightly different name and syntax for the multiple parallel pipes case? |
I think that's fine if your printing just collects whichever is done (reached the end of the pipe) from a queue, and prints one at a time (right?).. Then if we want to order the result, we wait for all concurrency to finish and order the results (which would still be faster) (?) |
But how do you know what the order should be at the other end @dwsutherland ? |
That's optional, user determined (or default). (?) |
(Note I'm not saying we don't want this maximum concurrency for |
I guess I need to understand it more, I thought the pipe bit was sequential, and the concurrent bit was the multiple pushes through the pipe |
See my description of the Cylc graph pipeline implementation above. A pipeline has a special kind of concurrency, like (literally) a single pipe with multiple processors spaced along its length. That's why it's called a "pipe". There's only one of each processor, and the data passes through the pipe sequentially, but once the pipe is full all of the processors are running concurrently (on different data elements). That's still more concurrency than passing one data element through the whole pipe before pushing the next element in the front, which is what was happening on this branch yesterday. But it's less concurrency than creating a whole new pipe for each data element. |
(Sorry @oliver-sanders - we should have this conversation on Element rather than cluttering up your PR!) |
Latest push looks, appears functionally sound 👍 And concurrent "pipes" working:
|
A generator to begin our pipe with: | ||
>>> @pipe | ||
... async def arange(): | ||
... for i in range(10): | ||
... yield i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example the generator (speak) is not asynchronous as there is no await in it anywhere.
I was following this example of yours in the code (although I admit to being confused about whether an async generator needs an await or not - I thought maybe yield
passes control back to the event loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(OK I think I understand how that works now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield
yields control to a generator in synchronous code, await
yields control to a coroutine in asynchronous code. We have an async-generator here, both paradigms still apply but only in their own context.
That's only with equal sleeps in my coroutines. If you change to |
Agreed. We don't need all the scanned workflows to collectively form something coherent (e.g. |
Yes, and purposefully so.
It never proclaimed to, it's an async pipe. It's possible to add a "preserve order" mode at some point, however, this isn't desirable for the The main motives were:
|
Ok three more commit which:
@dwsutherland, @datamel you might want to cast your eyes over these commits as they change the async logic. |
It is async, for sure; I'm just arguing that it's not a pipe, it's a more like a bunch of pipes. My classic pipe still has concurrency (albeit less of it), so I presume it could still have an async implementation and so still be called an "async pipe". Of course if "async pipe" is actually a known thing and doesn't have classic pipe behavior then I guess I'm just wrong about the meaning of the term. But either way there certainly are potential applications where preserved ordering would be expected, like line-by-line text file processing.
Great, even better 👍 |
I forgot to say up above, it's a great idea (especially with the pipe syntax) and I reckon you should consider making it available as a separate self-contained Python package. Then, as a generic library package, I'm sure people would want both kinds of behaviour for different applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy now 👍 (Thanks for assuaging my pipe-OCD 😬 )
Closes: #3615
Addresses: cylc/cylc-uiserver#96
Sibling to: cylc/cylc-uiserver#150, cylc/cylc-doc#138, conda-forge/cylc-flow-feedstock#10.
Re-write
cylc scan
and it's underlying logic for simplicity, performance and for better incorporation with Cylc UI Server and TUI.This is a precursor to:
Highlights:
cylc scan
andcylc print
.Async Pipes:
I've had a pet-project for a little while now of constructing asynchronous pipes in Python so this was a nice chance to use it.
Construct a pipe using
|
syntax like on the command line:The pipe is an async generator, iterate through it like so:
Once items start coming out of
scan
they get pushed into the next stage of the pipe, in theory each stage of the pipe could be running over different items simultaneously.Sounds complicated but it actually only takes a few lines of Python to make this happen, most of
cylc.flow.async_util
is just documentation.Here's your minimal scan client under the new system:
File System Efficiency:
In order to perform a scan we need to perform, potentially a very large number of directory listings.
cylc-run
directory.cylc-run/*
directory (this is recursive!)..service
dir of each registered flow.Before all of these directory listings happen one after the other, they are all blocking not only to each other but also to the other operations involved in the scan (e.g. contacting the flows themselves).
Now it is possible for listings to happen simultaneously thanks to a bit of
libuv
magic. This is the library behind node, there are a few lines of code required to marry up thelibuv
event loop toasyncio
, after that it's plain sailing.Scan Client:
The scan client is now pretty simple, I've knocked together a basic interface though this can be easily changed.
cylc scan
with no opts or args still as compatible with Cylc7 as it can be.name
andtree
formats replacecylc print
.rich
format.--colour-blind
option which still uses colour (--color=never
to disable completely) but ensures that colour isn't required to interpret the display.TODO:
Questions:
cylc scan
interface? Nonecylc scan -t raw
? Nocylc scan -t json
interface? Nocylc print
, the only thing print does which can doesn't is to list the reg dir (which scan could do if desired)?Requirements check-list
CONTRIBUTING.md
and added my name as a Code Contributor.