-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Dataset factory parsing rules demo #2559
Conversation
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
return matches[0] | ||
|
||
|
||
def specificity(pattern): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method just counting the number of characters that are not in brackets? Wouldn't it be easier to just remove the text in {} and then get the length of the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what is happening here but using parse
instead of a regex.
for key, value in best_match_config.items(): | ||
string_value = str(value) | ||
try: | ||
formatted_string = string_value.format_map(result.named) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only happen for dataset fields with {}
otherwise you get weird things like:
shuttles:
filepath: data/01_raw/shuttles.xlsx
load_args: data/01_raw/shuttles.xlsx
type: pandas.ExcelDataSet
But that's not really part of the matching logic, so not super important here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed to only match with patterns containing {}
but this logic will still work for dataset entries that are not patterns(exact matches) because of how parse
works -
parse("xyz", "abc")
returns None
(not a match at all - ignore)
parse("xyz", "xyz")
returns <Result () {}>
(exact match, not a pattern - result is not empty so still added to matches list)
parse("hello {name}", "hello world")
returns <Result () {'name': 'world'}>
(potential match- added to matches list)
So the exact match then gets chosen as the best match and since it wouldn't contain any brackets anywhere in the catalog entry, it would not be changed at all.
I think these rules look generally good 👍 I also think we need all three of them, and the order you suggest them is right, i.e. count non-bracketed characters, then count number of brackets, then sort alphabetically. We shouldn't need to consider the specificity of your example #5 ( Also I don't think it's a problem that #1 and #3 are the same or that we need to go through and remove duplicates. These mean exactly the same thing so we don't really care how they rank relative to each other. In practice they would be distinguished based on the final alphabetical order, and then it will always be the same one that gets matched against first and the other one will just get ignored. |
I also like the fact that the rule is simple. I suspect there will be edge cases but I am not too nervous about it. In general we should encourage people to write simple catalog, if it requires very obscure rules it's likely that they can just simplify it by changing the structure. One missing example that I can think of is whether we should consider namespace as a priority. Consider the name France.companies Which pattern will win?
Previous conversation favor the former one as it's more specific about namespace. This however require some more sophisticated rule. It shouldn't be hard tho, we can simply split on the dot and derive the score from there. |
Hmm, that's an interesting example about the namespaces. I haven't thought it through fully, but I can't immediately think of a rule that would would order these the way you want without it being quite complex though? To me it's also not immediately obvious that |
Something else I just thought of. Let's say you have these patterns:
Then the order of these would be 3 > 4 > 2 > 1 which is not really obvious, just because "switzerland" is a longer word than "companies". So a dataset name Maybe it would be less arbitrary to count number of unbracketed "chunks" outside brackets rather than number of characters? This would rank all the equally so would then need another rule to order them. Or maybe we should consider the order of where unbracketed chunks appear relative to bracketed ones in the pattern? Or maybe this is just an edge case that won't come up in practice and we shouldn't care about it? Note this is nothing specific to do with namespacing: if you replace |
I agree we don't need to treat the namespace specially, however for a good convention it's most likely that people should split it using some kind of delimeter for these pattern. It's also important to note that the patterns aren't mean for refactoring, there are still value to create explicit entry, it's probably not worth it to use a pattern to just replace two or three similar entries. (same apply for Jinja) Can we try to do this exercise with a realistic example? I think @merelcht example could be useful for documentation anyway, so it won't be a waste. It would be interesting to have at least two people to go through the same exercise and see what's the difference. I see the main use cases are two:
I should add that I think we can start with just counting characters and see how well it goes. |
Just checking with my understanding, for 2. with jinja you will loop through the list inside the YAML file. In contrast, with Dataset Factory you will loop through this directly inside the nodes file instead? |
My assumption is that any user ideally would not have a lot of "dataset patterns" that would match a given dataset name. As long as we have a set of simple + deterministic rules to order the patterns, we could leave it up to the user to deal with cases where multiple matches exist. |
Closing in favour of #2635 |
Description
Related to #2508
This PR is just to demonstrate/experiment with parsing rules - not the actual dataset factory prototype.
Development notes
I've added a
kedro catalog resolve
(Similar to what @merelcht did in #2560)For each named dataset used by the pipelines ->
It loops over all the catalog entries, finds patterns that would be a match and then selects the best match pattern (
pick_best_match
function)Problem Statement
The
matches
list contains all the patterns that match the dataset name - for example forfrance.companies
, all the patterns listed below would be a match -We need to decide the ranked order of these entries to find the best match.
Proposed Solution
#5 -> #4 -> #6/ #3/ #1 -> #2
#6
,#3

- pick the highest number of bracket pairs. Example:{namespace}.{dataset}
would be chosen over{dataset}s
.f{*}
should rank above{*}s
Notes / Challenges
The datasets that are supposed to be default/
MemoryDataSet
and parameters (eg.params:model_input
) will falsely match against the pattern. This will lead to runtime errors and we will need to document this.Checklist
RELEASE.md
file