Skip to content

Understanding namespaces

Juan Luis Cano Rodríguez edited this page Nov 11, 2024 · 1 revision

(From https://github.com/McK-Private/private-kedro/issues/752#issuecomment-736680109)

Essentially what we are talking about is a way to group elements of a graph and how to combine graphs to create new graphs out of those. So I will give it a try to explain this once and for all and hopefully dispelling any remaining confusion about the topic. I will add a list of the main questions raised here and my answers to them.

Establishing common understanding of concepts

  • Kedro pipelines are graphs with two types of vertices - nodes and datasets
  • Users are mainly concerned with the nodes
    • they use the datasets as a way to express dependencies between nodes
    • you cannot create a node without specifying what datasets it depends on
    • a node is defined by its input datasets, python function and output datasets
  • A pipeline can contain only unique nodes, i.e. there cannot be more than one node which have the same
    • inputs
    • function
    • outputs
  • In fact, pipelines are even more strict
    • no nodes can share an output
    • this is a requirement in data pipelines to ensure data determinism
    • so pipelines are collections of unique nodes
    • collection of unique elements is a set

Having established that a pipeline is a set of nodes, I can start explaining the concept of modular pipelines much easier now.

Reconciling the mental model for pipelines

In order to make the explanation crystal clear, I will use an analogy that hopefully everyone understands - a computer file system. And here is the parallels I will draw (~= means equivalent):

  • node ~= file
  • pipeline ~= folder - there are two types of folders
    • modular pipelines (with a namespace) ~= named folder
    • python variable ~= temp folder
  • top-level pipelines ~= all the folders at the root of your computer, i.e. /Applications, /Users, /Library, /System and others on macOS
  • __default__ = de + ds ~= copying the files and folders from the temp folder de and the files and folders from the temp folder ds into the folder __default__
  • __default__ = pipeline(de, namespace='de') + ds ~= copying the whole folder de and the files and fodlers from ds into __default__
  • __default__ = pipeline(de, namespace='de') + pipeline(ds, namespace='ds') ~= copying folders de and ds into __default__
  • modular pipeline created with kedro pipeline create ~= a physical folder with paper documents (files) you haven't scanned in your computer yet.
  • kedro pipeline list ~= ls -d /*/, or list all directories in the root of your computer
  • having de, ds and __default__ = de + ds as top-level pipelines ~= you have three folders in your computer root, de, ds and __default__ where __default__ contains copies of the files of the other two folders
  • having de, ds and __default__ = pipeline(de, namespace='de') + pipeline(ds, namespace='ds') as top-level pipelines ~= you have three folders, de, ds and __default__ where __default__ contains copies of the other two folders
  • Kedro Viz now ~= find -L /<selected-folder>, or all files of a selected folder at the computer root
  • Kedro Viz showing modular pipelines (goal of the modular pipeline visualisation workstream) ~= ls /<selected-folder>, or listing all files and folders in /<selected-folder without recursively showing the files in the folders
  • Kedro Viz showing expanded modular pipeline (future) ~= showing all files in /<selected-folder> + all unexpanded folders in /<selected-folder> + all files and folders of /<selected-folder>/<expanded-folder>/

Questions or statements (both marked as Q) from this discussion that need addressing

Q: In the current iteration of the pipeline selection feature, we want to allow users to select a pipeline by name or select a modular pipeline.

A: I have no access to the Zeplin board and wasn't in the meeting, but if that is what came out of it, this is not a real requirement and a pressing need at the moment. There are three high-level use cases for Viz that are being conflated here:

  1. Select a top-level pipeline to explore ✅
  2. Understand what modular pipelines the top-level pipeline consists of (and potentially expand/fold that to simplify the diagram) 💭
  3. See what modular pipelines they can use when developing their top-level pipeline ⛔

Use case 1. is already implemented, use case 2. is what we need to solve now. Use case 3. is very rarely brought up and at the moment Brix is taking care of that. Moreover for use case 3. to be successful, we need use case 2. done first.

Q: We don't keep track of pipelines' hierarchy

A: That's incorrect, we do keep the hierarchy through the namespace property of nodes (or at the time of writing, through the automated naming of the nodes when they use @lorenabalan's work on modular pipelines pipeline(...)). In your example of __default__ = de + ds there is no hierarchy to keep, since this is just getting the pipeline nodes and not actually using modular pipelines at all. However if de and ds both have namespaced nodes in there, that information is preseved even after the + operation. If each of de and ds contain nodes under the same namespace, e.g. namespace named wonderful_namespace, then after we merge the pipelines, there will be only one wonderful_namespace containing nodes from both ds and de.

Q: We don't expose modular pipelines at the moment

A: We do and we don't. I believe this is the crux of the confusion, we don't expose modular pipelines on their own, but we do expose their usage in a top-level pipeline. The communication challenge we have is that we refer to modular pipeline in two contexts:

  • the source code of the pipeline, with all its functions and its definition
  • its copy or instance that is being used in a top-level pipeline

It is very similar to the confusion someone new to OOP might have when talking about a class and a class instance. Maybe we should adopt a similar terminology here, modular pipeline and modular pipeline instance?

So it seems to me that over the last months when talking about modular pipelines we have been mixing these up a lot and that has resulted in a confused mental model, that needs reconciling.

Q: One question for the backend team, how difficult it would be to implement a hierarchical structure for those modular pipelines?

A: We already have the hierarchical structure and that is the namespaces, a.k.a. modular pipelines. We have already implemented and have had this even at the time of starting this conversation here. Modular pipelines are hierarchical grouping of nodes, tags are non-hierarchical grouping of nodes. Theoretically, you can use tags to achieve modular pipelines, but in practice having distinction between hierarchical grouping and non-hierarchical grouping of nodes makes a huge difference in usability and helps avoiding invalid states of overlapping hierarchies which can happen with using tags.

Q: Also, could we make it possible to select only the modular pipelines with namespaces in viz, since we currently cannot identify the ones without?

In order to be given a namespace, you need to use your modular pipeline within another pipeline. That is how a modular pipeline becomes a modular pipeline instance. So the modular pipelines that have namespaces will already be part of a top-level pipeline and you can find them when you select a top-level pipeline. When you decide to select a modular pipeline, you have already selected a top-level pipeline and you would like to chose from the modular pipelines it uses. However, each of those modular pipelines can be using other modular pipelines, and in fact those can continue at an arbitrary depth. If the plan is to have one dropdown with all modular pipeline instances that would be hard to navigate and not great for UX. Virtually all hierarchical systems use something like a tree explorer structure, e.g. file system explorers in macOS, Windows and Linux. Here is an example I have created in macOS using pipeline names and node names for folders and files:

image

Q: For example, if a modular pipeline is used in 2 different top-level pipelines, how would this look like in viz?

A: Viz is visualising only modular pipeline instances, thus there is no problem to visualise a modular pipeline used in 2 different top-level pipelines since they will be different instances by definition. Also this is a non-problem, since you can visualise only one top-level pipeline at a time in the very same way you can open only one folder in the same file browser window.

Q: How to visualise that two modular pipeline instances are originating from the same modular pipeline?

This is a preemtive question I ask and answer myself, possibly giving a bit more clarity on the real requirements on what needs to be visualized. As I understand it, a lot of the discussion here was motivated by this question without anyone stating it explicitly.

A: The short answer is we don't need to show this the very same way we don't show how two nodes use the same function. Moreover this is a premature problem to solve for, since we currently have no way to visualise modular pipeline instances at all and we should rather focus on that problem first. When we are done with it, we will know better if we need to show the origin of the modular pipeline instances. If it turns out we need this, we can draw inspiration from file browsers again in the way they visualise symlinks or shortcuts, or eventually resort to coloring (if @GabrielComymQB finds that appropriate).

Q: Namespaces are (almost) undesirable for free input/output datasets of a modular pipeline (to enable the connection between the modular pipeline itself and whatever the nodes it's wired to). Does it affect the We don't keep track of pipelines' hierarchy question?

A: When a dataset is being used to connect a modular pipeline to other nodes outside of it, they usually shouldn't have namespace indeed. However, this does not effect the pipeline hierarchy, since a pipeline is equivalent to a set of nodes and the grouping is determined only by the namespaces of the nodes and not the datasets. The only possible effect is that if you have an output with the same name by two unrelated pipelines or a modular pipeline used twice, then Kedro will error out and will not be able to construct the pipeline. That does not effect Kedro Viz in the slightest though, since Kedro Viz will be given only valid pipelines.

Clone this wiki locally