Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global universal identifier #3931

Merged
merged 92 commits into from
Jan 17, 2022
Merged

global universal identifier #3931

merged 92 commits into from
Jan 17, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Nov 10, 2020

closes #3592

Draft Done!

This has been a much harder conversion than anticipated, many apologies :(
Will try and thrash out the following shortly:

  • Need to fix tests/integration/test_data_store_mgr.py.
  • Awaiting more tests.
  • Need to fix / re-write tests/f/graphql/02-root-queries.t.
  • Need to sort out task and edge IDs in the data store.
  • Needs UserInputError messages in cylc.flow.id_cli.
  • Needs a little tidying.
  • Need to fix Tui.
  • Should try to make tokens a little safer
    • Perhaps subclass Dict to screen the tokens.
    • Or make the dictionary keys use the Tokens enum.
  • UIS branch - universal id: adopt the new universal ID format cylc-uiserver#297
  • UI branch - Universal ID cylc-ui#876

Convert cylc-flow to use the global universal identifier

Overview of UID:

  • See the cylc-admin proposal
    • Full format: ~user/workflow//cycle/task/job.
    • Relative references (CLI only): //cycle/task/job.
  • Task IDs are now nicely lexicographically sortable (providing all cycle points are of the same length).
    • Consequently there have been some sort order changes in cylc graph --reference and cylc play --reference-log output.

The universal ID replaces/unifies:

  • The data store format:
    • user|workflow|cycle|task|job
    • user|workflow|task-definition
  • The task proxy format:
    • task.cycle
    • task
  • The alternative task proxy format:
    • cycle/task
    • cycle
  • The job dir format:
    • cycle/task/job

There should now only be one interface for parsing to and from string IDs as well as a standard
tokens dictionary format for passing around parsed ID information.

Here are some of the parsing interfaces removed in this PR:

It also unifies the pattern matching in the:

  • Data store.
  • Task pool.
  • Task proxy.
  • CLI.

There is plenty of scope for future work better adopting the new tokens internal
interchange format to reduce the amount of needless parsing to / from string IDs.

CLI Overview:

# CLI help
cylc help id

# deprecated (but still supported)
cylc trigger workflow task.cycle

# new
cylc trigger workflow//cycle/task
cylc trigger workflow //cycle/task  # alt

# multi tasks
cylc trigger workflow//cycle1/task1 workflow//cycle2/task2
cylc trigger workflow //cycle1/task1 //cycle2/task2  # alt

# multi workflows
cylc trigger workflow1//*:failed workflow2//:failed workflow2//:submit-failed
cylc trigger workflow1//*:failed workflow2 //:failed //:submit-failed  # alt

Behaviour Changes:

  • CLI:
    • Src workflow paths must now start with "./" or be absolute.
      • We used to allow relative paths and check for ambiguity with run dirs.
      • E.G. cylc validate foo could mean ./foo or ~/cylc-run/foo.
      • I have not re-integrated this functionality.
      • So now if you want ./foo you must type ./foo.
      • Nicer?
    • Using the absolute path to a run directory is no longer supported (use the workflow ID).
    • Task patterns are now explicit rather than implicit.
      • Previously the absence of a task ID meant all tasks (*).
      • Now if you want to act on all tasks you must explicitly say so.
    • Cylc clean can now clean multiple workflows and prompts rather than fails in non-force mode.
  • Selected CLI commands have been upgraded to support multi-workflow functionality
    via cylc scan.
    • E.G. The missing: cylc stop '*'!
  • Task meta-queries with cylc show must now be provided with the --task-def argument.
  • Trigger now globs against the task pool (and matches future tasks) same as cylc hold.
    • It used to match more broadly.
    • Now if you want to trigger non-active tasks you must list them explicitly.
    • Removed tests/functional/cylc-trigger/05-filter-cycles.t
  • Namespace objects in the data store (i.e. task and family definitions) now have the ID //*/<name>:
    • Definitions don't fit into the UID scheme (hence using * for the cycle point).
    • Consider using a special character to cater to them more nicely E.G. //!namespace

Terminology & Variable Names:

UID gives us a good opportunity to standardise naming conventions, the UID scheme suggests the following:

Variable names:

  • User (not owner).
    • Owner was previously used to distinguish from the --user (Cylc used to have multi-user functionality).
  • Workflow (not flow).
    • Keep "flow" for "reflows".
  • Cycle (not point).
  • Task (not name).
  • Job (not submission_no, submission_num, etc).

CLI:

  • WORKFLOW_ID (not REG, FLOW).
  • ID (not TASK, CYCLE, TASK_ID, TASK_GLOB, etc).

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes - PLEASE REBASE SQUASH
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Changelog entry.
  • Docs - universal id cylc-doc#377
  • Downstream PRs.
  • No dependency changes.

@oliver-sanders oliver-sanders added this to the cylc-8.0a3 milestone Nov 10, 2020
@oliver-sanders oliver-sanders self-assigned this Nov 10, 2020
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 10, 2020

@dwsutherland I think this will marry well with the pre-existing universal identifier. Tagged you as the original author of the universal identifier to make sure I haven't thrown in any curveballs.

The only thing that I think will trip it up at the moment is the ability to specify selectors on cycle points (as opposed to tasks) though I'm not entirely sure what that would mean in different contexts and we can block that functionality for now.

cylc/flow/id.py Outdated
Returns:
list - List of tokens dictionaries.

Examples:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples show the possible formats and upgrading of legacy ids.

@dwsutherland
Copy link
Member

dwsutherland commented Nov 11, 2020

@oliver-sanders - Really amazing work, I wish I was as sophisticated as you!

The only thing that I think will trip it up at the moment is the ability to specify selectors on cycle points (as opposed to tasks) though I'm not entirely sure what that would mean in different contexts and we can block that functionality for now.

Selectors on cycle points would probably target the root family of that cycle-point (from an API perspective)

Also have you thought about the APIs use of the following ID format?
'~user/flow:flow_sel//task:task_sel'
(for task definitions.. perhaps we could put a None or the like in place for the sake of consistency)

Also, in case you/others need a reminder, ~user may seem a little redundant at the moment.. However, we have/will-have cross-workflows commands and graph representations (i.e. inter-workflow triggers represented by the nodes in a graph that belong to different workflows)

@hjoliver
Copy link
Member

Really amazing work, I wish I was as sophisticated as you!

Join the club - I think it's the "barnyard strength" cider he runs on 🍺

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 11, 2020

Aw shucks, didn't know there was so much love for regexes s/.*/swag/.

Selectors on cycle points would probably target the root family of that cycle-point (from an API perspective)

Could work, seemed like a good idea at the time but it gets a little bit weird:

# sets outputs on every task in cycles that have at least one succeeded task in the pool
cylc set-outputs flow/2020:succeeded

# hold all cycles that have at least one failed task in the pool
cylc hold flow//2020:failed

Also have you thought about the APIs use of the following ID format?

Can use a wildcard

myflow//*/taskname

I think in the cases where you might want to specify a task without a cycle we use the -n argument (e.g. cylc broadcast) which enables us to sidestep this.

inter-workflow triggers represented by the nodes in a graph that belong to different workflows

Nice.

The other thing I was thinking about was multi-user support.

Users would not need to log into group accounts to control workflows e.g:

cylc hold ~group/flow//2020*

And also admin privileges e.g:

cylc hold ~alice/*

The command would be sent to the UIS for "alice" where it would authenticate as me and check authorisation as it would for a command from the UI.

Especially nice as Cylc Flow would not need to provide any multi-user support, it just piggy-backs on the UIS which already has to do this stuff for the UI.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 11, 2020

Untested cycle/task matching for the task_pool (replaces the filter_task_proxies method):

    def filter_ids(self, ids, out=Tokens.Task):
        match = fnmatchcase
        # TODO: consider using wcmatch which would add support for
        #       extglobs, namely brace syntax e.g. {foo,bar}

        _cycles = []
        _tasks = []
        _jobs = []
        _not_matched = []

        for tokens in ids:
            for lowest_token in reversed(Tokens):
                if tokens.get[lowest_token.value]:
                    break

            cycles = []
            tasks = []
            jobs = []

            # filter by cycle
            if lowest_token == Tokens.Cycle:
                cycle = tokens[Tokens.Cycle.value]
                cycle_sel = tokens.get(Tokens.Cycle.value + '_sel', '*')
                for icycle, itasks in self.pool.items():
                    str_cycle = str(icycle)
                    if not match(str_cycle, cycle):
                        continue
                    if cycle_sel == '*':
                        cycles.append(str_cycle)
                        continue
                    for itask in itasks:
                        if match(itask.state.status, cycle_sel):
                            cycles.append(icycle)
                            break

            # filter by task
            elif lowest_token == Tokens.Task:
                cycle = tokens[Tokens.Cycle.value]
                cycle_sel = tokens.get(Tokens.Cycle.value + '_sel', '*')
                task = tokens[Tokens.Task.value]
                task_sel = tokens.get(Tokens.Task.value + '_sel', '*')
                for icycle, itasks in self.pool.items():
                    str_cycle = str(icycle)
                    if not match(str_cycle, cycle):
                        continue
                    if cycle_sel == '*':
                        cycles.append(str_cycle)
                        continue
                    for itask in itasks:
                        if (
                            match(itask.state.status, cycle_sel)
                            and match(itask.tdef.name, task)
                            and (
                                match(itask.state.status, task_sel)
                                or any(
                                    match(ns, task)
                                    for ns in itask.tdef.namespace_hierarchy
                                )
                            )
                        ):
                            tasks.append(itask)

            else:
                raise NotImplementedError

            if not cycles or tasks or jobs:
                _not_matched.append(tokens)
            else:
                _cycles.extend(cycles)
                _tasks.extend(tasks)
                _jobs.extend(jobs)

        ret = None
        if out == Tokens.Cycle:
            _cycles.extend([
                itask.point
                for itask in _tasks
            ])
            ret = _cycles
        elif out == Tokens.Task:
            _tasks.extend([
                self.pool[icycle]
                for icycle in _cycles
            ])
            ret = _tasks

        return ret, _not_matched

@hjoliver
Copy link
Member

Especially nice as Cylc Flow would not need to provide any multi-user support,

👍

Copy link
Member

@dwsutherland dwsutherland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty happy with these changes. Only a couple questions:

  • I guess the use of // & / is the best we can do? (slashes seem to be used in many places.. and even the double works in paths.. Can it be overused? or is it the URL safe choice)
  • I suppose there's no way to centralise // & / it can be changed easily? (i.e. f-string substitution in id.py)

Apologies if the discussion has already taken place (i.e. Proposal) on the above!

PS - If only there was a better way to pattern match selection, but I guess there's no avoiding the looping:

                for icycle, itasks in self.pool.items():
                    str_cycle = str(icycle)
                    if not match(str_cycle, cycle):
                        continue
                    if cycle_sel == '*':
                        cycles.append(str_cycle)
                        continue
                    for itask in itasks:

at least its only going through each task once I suppose O(n)... (same with non ID lookup resolvers)

@oliver-sanders
Copy link
Member Author

Good point about URL safety, will come back to after the beta release.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jun 8, 2021

I guess the use of // & / is the best we can do?

I think the use of / is great (plays into dir structure and CLI auto-completion nicely), the // is not so good.

Here's the rationale behind using two characters:

# chars considered:
/ ! @ % ^ & * ( ) - _ = + [ ] { } / | ; : , . < > ~ #

# conflicts with hierarchical workflow registrations
/
# conflicts with glob chars in the shell and the :sel pattern
[ ] { } * 
# conflicts with the :sel prefix
:
# conficts with shell
; ( ) < > | & #
# conflicts with ~user prefix
~
# conflict with workflow name
/ _ + - . @
# conflict with task name
- + % @
# conflicts with extglob (shell)
!
# partial conflict with shell (substitution?)
^

# chars left:
=,

So that just leaves us with the following two char combinations:

//
/=
=/
/,
,/

https://github.com/cylc/cylc-admin/blob/master/docs/proposal-universal-id.md

@dwsutherland
Copy link
Member

dwsutherland commented Jun 9, 2021

Not , or = by itself?

@oliver-sanders
Copy link
Member Author

Possible, but a bit strange:

~user/flow=cycle/task/job
~user/flow,cycle/task/job

@oliver-sanders oliver-sanders marked this pull request as draft June 22, 2021 13:18
@oliver-sanders
Copy link
Member Author

(on draft waiting for #4230 which will provide the place for this argument parsing to be inserted)

@oliver-sanders oliver-sanders removed this from the cylc-8.0b2 milestone Jul 15, 2021
@wxtim

This comment has been minimized.

@oliver-sanders
Copy link
Member Author

If this is replicable it's undesirable

That's a broken import in cylc-uiserver which is fixed in cylc/cylc-uiserver#297

@hjoliver
Copy link
Member

Yikes, I merged @wxtim's log-message tweak branch, and now we have 300 conflicting files here 😱

@dwsutherland
Copy link
Member

Yikes, I merged @wxtim's log-message tweak branch, and now we have 300 conflicting files here 😱

Another reason this should go in ASAP 😅

@hjoliver
Copy link
Member

hjoliver commented Jan 14, 2022

Yikes, I merged @wxtim's log-message tweak branch, and now we have 300 conflicting files here scream

Another reason this should go in ASAP sweat_smile

Yeah, currently attempting to fix the conflicts myself... vim macros to the rescue

@hjoliver
Copy link
Member

@oliver-sanders - I merged from master and deconflicted, and pushed a couple of commits to your branch. Looks like all the tests will pass, except for test-tutorial-workflow.

Are you particularly attached to your ~90-commit history on this branch, or happy to have it squashed?

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did lots of manual testing. No problems found during quick code sanity check (thanks @MetRonnie and @dwsutherland for doing the real work 😁 )

@hjoliver
Copy link
Member

Feel free to merge this...

@dwsutherland
Copy link
Member

Feel free to merge this...

Will see if Oliver is happy with the squash.. I guess

@oliver-sanders
Copy link
Member Author

Did lots of manual testing. No problems found

Thanks, managed to get the diff coverage up to ~90% which is good but glad to have some heavy manual testing!

Feel free to merge this...

I still have one comment from Ronnie to address about metavars in the CLI.

There appear to be three forms remaining, my best attempt ATM:

  • WORKFLOW_ID - workflow, ~user/workflow
  • TASK_ID - cycle, cycle/task, cycle/task/job
  • ID - workflow, workflow//cycle, workflow//cycle/task, ...

Are you particularly attached to your ~90-commit history on this branch

No (checkbox unticked in desc) and we should not merge PRs with messy commits like this (boats the repo and makes it very hard to track down bugs).

I'm going to squash at the end, hadn't done it yet to make review easier.

@oliver-sanders
Copy link
Member Author

I'll address the remaining comments, rebase squash and push shortly.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Jan 14, 2022

Dammit, I don't think I can rebase-squash because of the merge commit, may have to squash merge.

@oliver-sanders
Copy link
Member Author

Ok, last feedback comments addressed, tests willing that should do it.

@MetRonnie can you once over the docs changes in this commit - 76770ea - which standardises the CLI metavars and adds more docs.

@hjoliver
Copy link
Member

No (checkbox unticked in desc) and we should not merge PRs with messy commits like this (boats the repo and makes it very hard to track down bugs).

Agreed, that's why I did not merge, and asked - I expected you still wanted to do some history rationalization.

@hjoliver
Copy link
Member

hjoliver commented Jan 17, 2022

Mysterious functional test failures disappeared when I reran the tests. 🎉

Right, this bad boy is going in...

@hjoliver
Copy link
Member

@oliver-sanders - I'll squash and merge, to a single commit. It doesn't sound as if you wanted to do more complex history surgery to end up with n > 1 logical commits, but if you did, we can always back up the truck.

@hjoliver hjoliver merged commit 8c81b09 into cylc:master Jan 17, 2022
@oliver-sanders oliver-sanders deleted the universal-id branch January 17, 2022 10:25
@oliver-sanders
Copy link
Member Author

Thanks all for reviewing, that was a horrible one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Universal suite/task/cycle delimiter (universal identifier)
5 participants