-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC [KED-3023] - Enable plugins to extend starter-aliases default list #1265
Conversation
Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com>
Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com>
cbfecfd
to
6cca833
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Galileo-Galilei! Thanks very much for your work on this. In general I like this a lot and think that using entrypoints is a good solution - I'm just not sure on some of the details.
My main comment is that, even though different git repos is the most obvious use of this, it's conceivable that I might want to share a starter by some other means (a shared filepath say). kedro new --starter
is very flexible and can things like filepaths rather than just git repos, so I think we should make sure that extending starters is equally as generic. Similarly I might want to share two different starters that point to exactly the same place but just have different tags, i.e. different checkout
arguments. I think we should allow this too just to keep things generic. So my suggested format for the dictionary would be:
_STARTER_ALIASES = {"name": {"template_path": ..., "directory": ..., "checkout": ...}}
where directory
and checkout
are optional (directory
defaults to None
).
As for your other questions...
should we add extra conversion / parsing when loading the entrypoint to check the format of the constant (I think we should, this is likely error prone right now)
No strong feelings here - we don't check other entrypoints very carefully I think, and this is quite an advanced feature where there will just be one simple pattern to copy and paste from a documented example. So I'm not too worried about verifying the input.
should we enable plugin authors to override "official" starters? My current code prevents this but it may be acceptable.
I think best to leave the official ones unaltered - see next answer for why.
should we make a visual distinction when listing the starters ? i add an extra (*) at the beginning to ensure they are listed first and identifiable at first glance.
No strong feelings, but my suggestion would be to separate the output into sections according to the plugin source. Like how we do for global/project commands if you just type kedro
:
This would work nicely if you have starters coming from multiple plugins (assuming that's possible).
cc @lorenabalan
kedro/framework/cli/utils.py
Outdated
@@ -31,6 +31,7 @@ | |||
"line_magic": "kedro.line_magic", | |||
"hooks": "kedro.hooks", | |||
"cli_hooks": "kedro.cli_hooks", | |||
"starter": "kedro.starter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"starter": "kedro.starter" | |
"starters": "kedro.starters" |
Just seems more natural to me (like hooks
).
Not sure whether this should go inside some hierarchy like kedro.constants.
as you suggested before (but then what else might go in there?) or whether it belongs at the top level 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitated for the "s", let's go for the suggestions. I know we originally discussed about calling it "constants", but I make very strong assumptions on the format when looping on "starters" entrypoint, so This is clearly not generic enough to accept other constants (+ I do not see any other use case as you mention)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'm happy to have it as a top-level entry point kedro.starters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I've just remembered why I choose to call it "starter" instead of "starters": it is the name tof the CLI group command (kedro starter list
) so I wanted to be consistent. What do you want to do? Keep the "s" or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. Personally I'd go for starters
since I wouldn't be too surprised if kedro starter
at some point in the future somehow gets subsumed into kedro new
, and the inconsistency here makes grammatical sense to me anyway. But let's see what someone else thinks too - @lorenabalan?
Hi @AntonyMilneQB, thanks for the review. The more I think about it, the more I think the entrypoint solution is "pythonic" indeed. I even found a reference doing something similar.
I was wondering if setting the
I actually had the very same idea at first, but I don't know how to make it work honestly 😅 I want to reuse your |
… Display with separation between plugins
Fair enough. I was thinking that
Ah yes I see, that is annoying. The type hint with |
I've just managed to make it work but the code is poorly readable. I do not use Edit: yes, I used exactly what you suggested, i found the same code snippet.
Yes that's what I figured out in the code. It makes sense for kedro itself to have a default to the starter matching the kedro version, but i don't think it really does for user defined starters. I implemented it however to match your suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! I left a couple more suggestions but this is looking really good now! Let's try and get a second pair of eyes on it before it becomes too dominated just by my opinion.
Yes that's what I figured out in the code. It makes sense for kedro itself to have a default to the starter matching the kedro version, but i don't think it really does for user defined starters. I implemented it however to match your suggestion.
Yeah, this is something I've wondered before also. Just for the purposes of this PR, I think we should leave the current behaviour as it is (i.e. by default checkout
= kedro version, not main
) and save possible changes to that for a separate discussion.
Then the only question in my mind then is whether plugin-defined starters should be allowed to define checkout
themselves or not. Not a big deal either way - I'm happy just to defer to you on it or see what the third opinion says.
kedro/framework/cli/starters.py
Outdated
starters_by_origin = {} | ||
for starter_name, starter_config in starters_aliases.items(): | ||
starter_raw_config = { | ||
starter_name: {k: v for k, v in starter_config.items() if k != "origin"} | ||
} | ||
if starter_config["origin"] in starters_by_origin: | ||
starters_by_origin[starter_config["origin"]].append(starter_raw_config) | ||
else: | ||
starters_by_origin[starter_config["origin"]] = [starter_raw_config] | ||
for origin, starter_aliases in starters_by_origin.items(): | ||
click.secho(f"Starters from '{origin}'\n", fg="yellow") | ||
output = [] | ||
for starter_alias in starter_aliases: | ||
for starter_name, starter_config in starter_alias.items(): | ||
starter_name_displayed = ( | ||
f"{starter_name} " | ||
f"(template_path={starter_config.get('template_path')}, " | ||
f"directory={starter_config.get('directory')}, " | ||
f"checkout={starter_config.get('checkout')})" | ||
) | ||
output.append(starter_name_displayed) | ||
click.echo(yaml.safe_dump(output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few possible simplifications here. I think we can rely on Python dictionary ordering, so you can do it without needing to restructure the dictionary at all:
last_origin = "kedro"
click.secho(f"Built-in starters\n", fg="yellow")
for starter_name, starter_config in starters_aliases.items():
origin = starter_config.pop("origin")
if origin != last_origin:
click.secho(f"\nStarters from {origin}\n", fg="yellow")
last_origin = origin
click.echo(f"{starter_name}: {starter_config}")
Or if you do want to restructure the dictionary then it's probably possible using something like this.
... which is not super beautiful I will admit 😅 It would be nicer to make this a bit more human-presentable, maybe by using yaml.dump(default_flow_style=True)
instead of just printing the raw dictionary. But I do like only giving directory
and checkout
when they're explicitly defined. Overall it would be nice if the readability of the output kedro starter list
wasn't significantly degraded compared to how it currently looks in spite of the additional generality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, but I wouldn't do pop()
because we'd only get a good output the first time user runs kedro starter list
. The second time it's run, the key "origin" won't be there anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AntonyMilneQB I thought dictionnary ordering was only from python>=3.7
and forward and I explicitly avoid relying on dict ordering because I know you still support python=3.6
:) But I was wrong so glad to make it more readable.
@lorenabalan I think pop()
should be fine because the _get_starter_aliases()
is called at the beginning of the function each time, and it does itself copy the _STARTERS_ALIASES constant. It would be safer to explictly deepcopy the dict in case of future changes though, do you prefer to enforce it?
I am still experimenting to see if we can have something more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Galileo-Galilei for taking a stab at this! 🙌 I like the general direction of this. :) My main observation is that I'd like us to limit the scope to just "auto-discovery of starters from plugins", and not worry about custom checkout
and directory
conventions. If we change anything around there it will automatically make this PR a breaking change.
I believe we shouldn't allow natively supported starter aliases to be overwritten, that can cause a lot of confusion. Also some minimal checks on stuff returned from the entrypoint would be nice. We don't have checks on the others because hooks and CLI commands are harder to validate than a plain dictionary. If a user-defined plugin defined a bad starter entrypoint, we should wrap it up in a nice exception message.
Would love to also see some documentation added before merging.
kedro/framework/cli/starters.py
Outdated
starters_by_origin = {} | ||
for starter_name, starter_config in starters_aliases.items(): | ||
starter_raw_config = { | ||
starter_name: {k: v for k, v in starter_config.items() if k != "origin"} | ||
} | ||
if starter_config["origin"] in starters_by_origin: | ||
starters_by_origin[starter_config["origin"]].append(starter_raw_config) | ||
else: | ||
starters_by_origin[starter_config["origin"]] = [starter_raw_config] | ||
for origin, starter_aliases in starters_by_origin.items(): | ||
click.secho(f"Starters from '{origin}'\n", fg="yellow") | ||
output = [] | ||
for starter_alias in starter_aliases: | ||
for starter_name, starter_config in starter_alias.items(): | ||
starter_name_displayed = ( | ||
f"{starter_name} " | ||
f"(template_path={starter_config.get('template_path')}, " | ||
f"directory={starter_config.get('directory')}, " | ||
f"checkout={starter_config.get('checkout')})" | ||
) | ||
output.append(starter_name_displayed) | ||
click.echo(yaml.safe_dump(output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment, but I wouldn't do pop()
because we'd only get a good output the first time user runs kedro starter list
. The second time it's run, the key "origin" won't be there anymore.
kedro/framework/cli/starters.py
Outdated
for starter_name, starter_config in starter_alias.items(): | ||
starter_name_displayed = ( | ||
f"{starter_name} " | ||
f"(template_path={starter_config.get('template_path')}, " | ||
f"directory={starter_config.get('directory')}, " | ||
f"checkout={starter_config.get('checkout')})" | ||
) | ||
output.append(starter_name_displayed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do that much formatting ourselves because yaml.dump
can handle Python dictionaries. It's what we do for kedro catalog list
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made another attempt with yaml.dump
. i am a bit worried about readibilty though, since it displayed inforamtion on several lines for each starter and there are a lot of them, so it may hurt discoverability. Current implementation looks like this:
I can reorder the keys (to have "name" as the first key) and remove unused key (origin
or directory
when it is None
), but I still find it harder to read than above suggestion. What do you think?
Hi @lorenabalan and @AntonyMilneQB, I've made a couple of changes since last review, tell me what you think:
{"astro-airflow-iris": {
"template_path": _STARTERS_REPO,
"directory": "astro-airflow-iris",
}, ...} to this [
{
"name": "astro-airflow-iris",
"template_path": _STARTERS_REPO,
"directory": "astro-airflow-iris",
}, ...] Avoiding nested dictionnary make manipulations easier (reordering by "origin" or "name" depending on what is needed) is easier this way, and I think this makes code more readable in general.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @Galileo-Galilei. As soon as there's some tests and docs feel free to move it from a Draft, and I'm sure I'll be happy to approve 🙂
Your fake plugin repo is a bit out of date given the new dictionary structure for STARTER_ALIASES
you're using - here's one that's up to date if you want to play around more: https://github.com/AntonyMilneQB/fake-plugin-starter
kedro/framework/cli/starters.py
Outdated
import os | ||
import re | ||
import shutil | ||
import stat | ||
import tempfile | ||
from collections import OrderedDict | ||
from pandas import value_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is used anywhere, but kedro doesn't have pandas
as a dependency so we shouldn't use any of their functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an auto import from my IDE, this is not used indeed.
kedro/framework/cli/starters.py
Outdated
@@ -23,21 +27,49 @@ | |||
_clean_pycache, | |||
_filter_deprecation_warnings, | |||
command_with_verbosity, | |||
# load_entry_points, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# load_entry_points, |
kedro/framework/cli/starters.py
Outdated
def _check_starter_entrypoint_config(module_name: str, config: Dict[str, str]) -> None: | ||
if not isinstance(config, dict): | ||
raise ValueError( | ||
f"The starter configuration loaded from entrypoint should be a 'dict', got '{config}' instead" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error could be clarified a bit. Currently with your fake starter I get this:
ValueError: The starter configuration loaded from entrypoint should be a 'dict', got 'my_fake_starter' instead
- can we put
module_name
in there like in the other error messages? - did you mean
type(config)
? - this sounds like a
TypeError
?
kedro/framework/cli/starters.py
Outdated
missing_keys = mandatory_keys - set(config.keys()) | ||
if missing_keys: # mandatory keys | ||
raise ValueError( | ||
f"Entrypoint kedro.starters from {module_name} must have the following missing keys: '{missing_keys}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Entrypoint kedro.starters from {module_name} must have the following missing keys: '{missing_keys}'" | |
f"Entrypoint kedro.starters from {module_name} must have the following keys, which are currently missing: '{missing_keys}'" |
kedro/framework/cli/starters.py
Outdated
mandatory_keys = {"name", "template_path"} | ||
optional_keys = {"directory"} | ||
|
||
missing_keys = mandatory_keys - set(config.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor suggestion:
missing_keys = mandatory_keys - set(config.keys()) | |
provided_keys = set(config.keys()) | |
missing_keys = mandatory_keys - provided_keys |
and then reuse provided_keys
on L111
kedro/framework/cli/starters.py
Outdated
"astro-airflow-iris": {"template_path": ..., directory=..., "origin": "kedro"}, | ||
..., | ||
"my-awesome-starter": {"template_path": ..., directory=..., "origin": "my-awesome-plugin"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this or the type hint are up to date any more?
kedro/framework/cli/starters.py
Outdated
): | ||
module_name = starter_entry_point.module_name.split(".")[0] | ||
for starter_config in starter_entry_point.load(): | ||
_check_starter_entrypoint_config(module_name, starter_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have an invalid starter specification then this will raise ValueError
and I won't be able to see the built-in starters. I would need to pip uninstall
the plugin in order to get kedro starter
working again I guess?
This is ok, but I wonder if it would be better to make those exceptions just warnings instead. No strong feelings, up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted. I guess it should raise a warning. As a user, I would hate having an error I don't understand here and I cannot correct here.
kedro/framework/cli/starters.py
Outdated
# see https://www.geeksforgeeks.org/group-list-of-dictionary-data-by-particular-key-in-python/ | ||
# this returns a nested dictionary {name1: {template_path: xxx, directory: xxx}, name2: {template_path: xxx, directory: xxx}, ...} | ||
starters_aliases_by_name = { | ||
k: list(v)[0] for k, v in groupby(starters_aliases, key=itemgetter("name")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k: list(v)[0] for k, v in groupby(starters_aliases, key=itemgetter("name")) | |
name: list(v)[0] for name, v in groupby(starters_aliases, key=itemgetter("name")) |
and some other descriptive name for v
if possible? I like how you named variables in starters_aliases_by_origin
🙂
@@ -288,7 +399,7 @@ def _get_cookiecutter_dir( | |||
f" Specified tag {checkout}. The following tags are available: " | |||
+ ", ".join(_get_available_tags(template_path)) | |||
) | |||
official_starters = sorted(_STARTER_ALIASES) | |||
official_starters = sorted(_STARTERS_ALIASES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message that follows this isn't just going to give a list of inbuilt starters any more; it will give the whole _STARTERS_ALIASES
dictionary which isn't good.
Given that now we can have starters from other sources maybe we should just recommend the user to run kedro starter list
to see what's available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return only inbuilt starters: the dictionarry is always deepcopied when _get_starters_aliases
is called, so I think the message is still giving the right result. But maybe it would be better to call explicitly _get_starters_aliases()
here and modify the message accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry I wasn't clear here. The problem I see isn't so much that this shows only inbuilt starters (though maybe that is a small problem also) but what the exception message will say. Won't this line print out the whole dictionary, with all its subkeys etc., rather than just showing the alias names of the starters?
yaml.safe_dump(official_starters)
kedro/framework/cli/starters.py
Outdated
starters_aliases = _get_starters_aliases() | ||
starters_aliases_by_origin = { | ||
# origin: [ | ||
# { | ||
# k: v | ||
# for k, v in config.items() | ||
# if k in {"name", "template_path", "directory"} and v is not None | ||
# } | ||
# for config in config_list | ||
# ] | ||
origin: list(configs) | ||
for origin, configs in groupby( | ||
sorted(starters_aliases, key=itemgetter("origin")), key=itemgetter("origin") | ||
) | ||
} | ||
|
||
# ensure kedro starters are listed first | ||
kedro_starters = starters_aliases_by_origin.pop("kedro") | ||
click.secho(f"\nBuilt-in starters\n", fg="yellow") | ||
click.echo(yaml.dump(kedro_starters)) | ||
|
||
for origin, starter_config in starters_aliases_by_origin.items(): | ||
click.secho(f"\nStarters from {origin}\n", fg="yellow") | ||
click.echo(yaml.dump(starter_config)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending WAAAY too long playing around with this (and for some time unknowingly working on something that just reproduced what you had obviously thought of before and commented out...), here's the best I've come up with:
starters_aliases_by_origin = groupby(
sorted(_get_starters_aliases(), key=itemgetter("origin")),
key=itemgetter("origin"),
)
for origin, starter_config in starters_aliases_by_origin:
click.secho(f"\nStarters from {origin}\n", fg="yellow")
for starter in starter_config:
del starter["origin"]
name = starter.pop("name")
click.echo(yaml.dump({name: starter}))
If we care about ordering so that kedro
is always first, you could do the below, but personally I don't think it's worth the extra complexity and would just leave it unordered.
starters_aliases_by_origin = {k: list(v) for k, v in groupby(
sorted(_get_starters_aliases(), key=itemgetter("origin")),
key=itemgetter("origin"),
)}
for origin, starter_config in sorted(starters_aliases_by_origin.items(), key=lambda x: x[0] != "kedro"):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I played a lot with the formatting too and get to the same conclusion as you. I'd personnally prefer showing built-in starters first but it is really not a big deal.
Hello @Galileo-Galilei, I was just wondering if you would mind if we on the kedro team took this over? It feels like a nice feature and I think we're most of the way there so would be nice to get it in 🙂 Am I right in thinking that it's basically just tests and documentation left to do, or did you want to put anything else in here? |
Hi @AntonyMilneQB, sorry for the late reply. There are indeed "only" tests and documentation to write but it's still a couple of hours of work (there are a bunch of edge case scenarii + the test setup is quite complicated to emulate plugin registration). I have very little time for now, so feel free to take over the PR, else I might resume working on it in a couple of weeks. Thank you very much for the guidance through the code review and sorry for not being as available as I wish I could. |
Takeover this in #1592 |
Closing this in favour of #1592. Thank you very much for all your initial work on this and the idea in the first place @Galileo-Galilei! |
Description
This PR implements part of the suggestions described in #1040 and creates a fully functional examples on how kedro starter discovery could be modified to enable plugins to created aliased starters.
Development notes
kedro.starter
entrypoint which can is iterated over to update the _STARTERS_ALIASES constant.kedro new
andkedro starter list
commands to use the updated _STARTERS_ALIASES constant (with new aliases discovered in the entrypoints).As of now, the code is already working. You can test it with :
and see the following output:
You can have a look at the plugin code if needed.
Some questions are still to be answered:
Checklist
RELEASE.md
fileAs of now, I did not add tests and documentation: I'll update them once you will confirm these changes are desirable