Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address code confusion in the lang package regarding node names #5021

Closed
jsternberg opened this issue Jul 25, 2022 · 1 comment
Closed

Address code confusion in the lang package regarding node names #5021

jsternberg opened this issue Jul 25, 2022 · 1 comment

Comments

@jsternberg
Copy link
Contributor

In our code for the planner, we have some interactions that are less than ideal and can cause a great amount of confusion.

Some background is required as the problem only arises when you see the interaction between different components. There isn't a single component that is "the problem", but more how the components interact with each other.

Dataset ids are used in the code to determine the source from where data is coming from. We use this in multi-parent transformations such as join to determine where data came from to do things like properly joining the data. For single parent transformations, this information is generally disregarded as it isn't important. When we produce a plan, we include those dataset ids as they can help debug information about the plan.

To produce dataset ids, we take the plan node name and produce a hash. We do this for some consistency with the dataset ids so we can compare different plans which makes debugging a bit easier. We considered changing this to a random dataset id, but this was ultimately rejected. The PR is here but discussion about the implications was on private channels and I don't really know where that discussion is anyway.

This ultimately led to a problem when it came to planner rules. A planner rule would occasionally need to create a new node and it would need to choose a name. We had this happen and it generated two different plan nodes with identical names so the dataset ids were not unique. In order to have a valid plan, the node names also need to be unique. To address this, we added CreateUniquePhysicalNode to keep some state that would add a sequential number to the plan node name to keep things unique.

Unfortunately, this code path also stores this state in the context and requires that state to be injected into the context. Recently, we discovered that the repl was not properly injecting this state so CreateUniquePhysicalNode just didn't work in the repl. This is because it's injected in AstProgram which is an admittedly strange place to insert it.

This might just be a simple refactor to move where this happens. This might be refactoring this section of code to not use the context to store such important state. Something that probably shouldn't be optional. But, I also believe this brings up something that is generally more confusing. It can be unclear exactly where the entry point for invoking a flux program starts. We may want to also explore that and create new issues for refactoring this package in general. It is a very confusing package to follow the logic and has mostly existed the way it is for a long time without any modifications.

@github-actions
Copy link

This issue has had no recent activity and will be closed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant