You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In our code for the planner, we have some interactions that are less than ideal and can cause a great amount of confusion.
Some background is required as the problem only arises when you see the interaction between different components. There isn't a single component that is "the problem", but more how the components interact with each other.
Dataset ids are used in the code to determine the source from where data is coming from. We use this in multi-parent transformations such as join to determine where data came from to do things like properly joining the data. For single parent transformations, this information is generally disregarded as it isn't important. When we produce a plan, we include those dataset ids as they can help debug information about the plan.
To produce dataset ids, we take the plan node name and produce a hash. We do this for some consistency with the dataset ids so we can compare different plans which makes debugging a bit easier. We considered changing this to a random dataset id, but this was ultimately rejected. The PR is here but discussion about the implications was on private channels and I don't really know where that discussion is anyway.
This ultimately led to a problem when it came to planner rules. A planner rule would occasionally need to create a new node and it would need to choose a name. We had this happen and it generated two different plan nodes with identical names so the dataset ids were not unique. In order to have a valid plan, the node names also need to be unique. To address this, we added CreateUniquePhysicalNode to keep some state that would add a sequential number to the plan node name to keep things unique.
Unfortunately, this code path also stores this state in the context and requires that state to be injected into the context. Recently, we discovered that the repl was not properly injecting this state so CreateUniquePhysicalNode just didn't work in the repl. This is because it's injected in AstProgram which is an admittedly strange place to insert it.
This might just be a simple refactor to move where this happens. This might be refactoring this section of code to not use the context to store such important state. Something that probably shouldn't be optional. But, I also believe this brings up something that is generally more confusing. It can be unclear exactly where the entry point for invoking a flux program starts. We may want to also explore that and create new issues for refactoring this package in general. It is a very confusing package to follow the logic and has mostly existed the way it is for a long time without any modifications.
The text was updated successfully, but these errors were encountered:
In our code for the planner, we have some interactions that are less than ideal and can cause a great amount of confusion.
Some background is required as the problem only arises when you see the interaction between different components. There isn't a single component that is "the problem", but more how the components interact with each other.
Dataset ids are used in the code to determine the source from where data is coming from. We use this in multi-parent transformations such as join to determine where data came from to do things like properly joining the data. For single parent transformations, this information is generally disregarded as it isn't important. When we produce a plan, we include those dataset ids as they can help debug information about the plan.
To produce dataset ids, we take the plan node name and produce a hash. We do this for some consistency with the dataset ids so we can compare different plans which makes debugging a bit easier. We considered changing this to a random dataset id, but this was ultimately rejected. The PR is here but discussion about the implications was on private channels and I don't really know where that discussion is anyway.
This ultimately led to a problem when it came to planner rules. A planner rule would occasionally need to create a new node and it would need to choose a name. We had this happen and it generated two different plan nodes with identical names so the dataset ids were not unique. In order to have a valid plan, the node names also need to be unique. To address this, we added CreateUniquePhysicalNode to keep some state that would add a sequential number to the plan node name to keep things unique.
Unfortunately, this code path also stores this state in the context and requires that state to be injected into the context. Recently, we discovered that the repl was not properly injecting this state so
CreateUniquePhysicalNode
just didn't work in the repl. This is because it's injected inAstProgram
which is an admittedly strange place to insert it.This might just be a simple refactor to move where this happens. This might be refactoring this section of code to not use the context to store such important state. Something that probably shouldn't be optional. But, I also believe this brings up something that is generally more confusing. It can be unclear exactly where the entry point for invoking a flux program starts. We may want to also explore that and create new issues for refactoring this package in general. It is a very confusing package to follow the logic and has mostly existed the way it is for a long time without any modifications.
The text was updated successfully, but these errors were encountered: