Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to Specify a "root" for DataCatalog.from_config #2965

Closed
noklam opened this issue Aug 23, 2023 · 10 comments
Closed

Add ability to Specify a "root" for DataCatalog.from_config #2965

noklam opened this issue Aug 23, 2023 · 10 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Aug 23, 2023

Description

Related #2924 (comment)

Context

conf_catalog = config_loader["catalog"] # config
conf_catalog = _convert_paths_to_absolute_posix(Path("../../").resolve(), conf_catalog) # config/catalog
catalog = DataCatalog.from_config(conf_catalog) # catalog
catalog.load("example_data")

Currently using context.catalog (framework) wlil do some magic path conversion. When user use DataCatalog alone they don't have the ability to define where is the "root" directory.

It contradicts with kedro suggesting notebooks/ and data. Because if notebook is inside notebooks, you would need to use ../ inside the catalog.yml. This is bad because it prevent the catalog.yml to be reused and assume that all the notebook will has to be in the same folder (preventing multi-users sharing 1 catalog)

Requirements

This should be:

  • backward compatible - no impact on existing framework user
  • notebook/library user should be able to configure the "root" of the data source

Consider:

To note, even if using a root, the converted keys are still hardcoded, so probably we should do something about it too (say that some user wants to call it file_path) #2965 (comment)

Possible Implementation

See detail discussion

DataCatalog.from_config(root="../data") 

the new_argument could be name data_source, root etc (undecided, this should be taken into consideration when implemented)
#2924 (comment)

Possible Alternatives

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Aug 23, 2023
@noklam noklam changed the title Add ability to specific a "root" for DataCatalog Add ability to Specify a "root" for DataCatalog Aug 23, 2023
@astrojuanlu
Copy link
Member

To note, even if using a root, the converted keys are still hardcoded, so probably we should do something about it too (say that some user wants to call it file_path)

@noklam noklam changed the title Add ability to Specify a "root" for DataCatalog Add ability to Specify a "root" for DataCatalog.from_config Aug 24, 2023
@astrojuanlu
Copy link
Member

xref #1934

@noklam
Copy link
Contributor Author

noklam commented Aug 30, 2023

cross reference an user question - pinging the magic bot to convert this into linen link...

Vincent Liagre
9 minutes ago
[Finished editing]
Hello team ;
I am trying to use Kedro Catalog and IO functions in standalone mode (i.e. w/o using a full Kedro structure). However I have difficulties "anchoring" the ConfigLoader object ; let me give an example (referencing file from root) :

@astrojuanlu
Copy link
Member

In our internal discussions it was noted by @yetudada that, regardless of what name we give to this parameter, we would still be making DataCatalog.from_config perform 2 different operations:

  1. Turn the JSON config into the appropriate initialization parameters (current behavior) and
  2. Convert the local filepaths to absolute (new, proposed behavior)

@yetudada pointed out that Intake has a {{ CATALOG_DIR }} template variable they use in front of relative paths https://intake.readthedocs.io/en/latest/catalog.html#yaml-format however, it was already noted in our Tech Design session #2924 (comment) that this requires a config loader to work, which is something we don't want.

Personally I don't think it would be productive to rehash the discussion we already had, since we explored and evaluated a number of options and no new ideas have been presented. It might be the case that no perfect, clean solution exists.

On the other hand, my initial concern that the list of hardcoded key names in the session is still not transparent #2965 (comment) hasn't been addressed yet. As a reminder, it is

# only check a few conf keys that are known to specify a path string as value
conf_keys_with_filepath = ("filename", "filepath", "path")

So, to make DataCatalog.from_config(..., local_root="...") do the conversion, we'd need to have that list of hardcoded keys as well. Alternatives like DataCatalog.from_config(..., local_filepath_keys=("filename", "filepath", "path")) could be considered.

@astrojuanlu
Copy link
Member

Another comment from @Yetunde: #2819 (comment)

  • What is the most intuitive way to load relative paths in catalog.yml? source="../data" is really confusing and I'm not sure intuitive naming will help this. Additionally, we would have to teach this concept to new users of Kedro because we only teach Kedro using local data. And, this concept only appears for local data; our experience is not consistent when compared to how we would use s3 or Azure Blob Storage.
  • How do other libraries handle relative paths in configuration?

@jasonmhite
Copy link
Contributor

jasonmhite commented Oct 10, 2023

Forgive me for rehashing as I try to wrap my head around this, but if I understand things correctly, it seems like the core of this issue is that the code for loading the data catalog doesn't have access to the project configuration, or at least sometimes it will not? And this is a deliberate design decision because the goal is to decouple the catalog instantiation from the runtime?

It seems this is most often going to be an issue with kedro ipython and/or kedro jupyter since those are the most likely suspects where the interpreter is going to change from the project root directory (especially Jupyter, since if you click into the notebooks directory it always changes directories). A crude but effective solution could be to simply add to the initialization magic for jupyter/ipython something like os.chdir(context.project_path) or likewise.

I can't say I fully understand what's going on under the hood, but generally project_path seems special apart from the other project configuration. Having a way to access it anywhere in a kedro invocation (be it in a pipeline, in a dataset implementation, etc) seems like a valid case for special treatment. Perhaps not load an entire configuration, but a function to discover the current root would help at least my use case (#3149). It seems like accessing the project context used to be possible in older versions but I think this was removed/refactored? I had a bit of a hard time following that part of the code.

@astrojuanlu
Copy link
Member

hi @jasonmhite , thanks for chiming in:

it seems like the core of this issue is that the code for loading the data catalog doesn't have access to the project configuration, or at least sometimes it will not? And this is a deliberate design decision because the goal is to decouple the catalog instantiation from the runtime?

that's correct.

We're trying to solve for a set of requirements that is very difficult to satisfy, namely:

  1. The solution must make using DataCatalog easier to use as a standalone component, and enable users to easily share catalog.yml files.
    • This means that, for catalog files containing local paths to be easily shareable, they should not use absolute paths (the reproducibility killer). But relative paths depend on the current working directory and where the file is placed. That's the main problem we're solving for.
    • Notice that this does not affect database connections nor remote files. Which, I've been told, is the large majority of cases for professional teams.
  2. The solution must not introduce coupling between DataCatalog and other objects (like OmegaConfigLoader) or Kedro framework assumptions.
    • This rules out solutions that involve having a template variable {{ project_root }} because it would make the DataCatalog responsible of resolving that variable.
  3. The solution should be backwards compatible with existing catalogs. Existing catalogs all use relative paths for local files. This relative path is mangled by Kedro framework, whereas the DataCatalog object doesn't do anything about it.
    • This again rules out solutions that involve having a template variable {{ project_root }} because no existing Kedro catalogs use it.

@jasonmhite
Copy link
Contributor

jasonmhite commented Oct 11, 2023

@astrojuanlu Is it possible to push it into the Dataset implementation? You could extend the base class to add a member that tries to resolve the project root directory if it is available from the context , otherwise resolve to the cwd. Then it becomes the responsibility of the implementation to actually use that base directory. It wouldn't fix the issue automatically, but at least it would then be possible to update a dataset implementation to handle things correctly and wouldn't break anything.

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 20, 2023

We discussed this again today with @yetudada, @idanov and @merelcht.

This is how the error manifests:

image

@jasonmhite we considered your idea but we thought that we'd rely on all datasets conforming to an implicit convention, which might introduce confusion the moment someone doesn't implement that extra parameter for the datasets.

Historical perspective: the paths are converted to absolute because of https://jira.quantumblack.com/browse/KED-1796 (internal link)

  1. Kedro expands relative local paths from catalog using pathlib.Path, which results in a WindowsPath instance on Windows
  2. This WindowsPath is stringified and is then passed to PickleDataSet, which in turn converts it to PurePosixPath and passes it to AbstractVersionedDataSet initialiser
  3. WindowsPath -> str -> PurePosixPath breaks things and should rather be something like WindowsPath.to_posix() -> PurePosixPath
  4. Ideally Kedro should not pass (potential) Windows paths around as strings

More relevant issues: #412

We decided to try to improve the error message in this case, and if we do add a parameter then give it a name that relates to the Current Working Directory.

Note that this error doesn't even appear if catalog.yml and the notebook or script and the data are all in the same working directory.

@astrojuanlu
Copy link
Member

Decision turned into #3248, closing this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

3 participants