Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_to_package decorator hook. #834

Merged
merged 7 commits into from
Dec 9, 2021

Conversation

valayDave
Copy link
Collaborator

- Cherry picking diff from card branch.
@valayDave
Copy link
Collaborator Author

cc @pjoshi30

@valayDave
Copy link
Collaborator Author

@romain-intel / @savingoyal : please review when you get the time.

file_path, _ = path_tuple
# Check if the path is not duplicated as
# many steps can have the same packages being imported
if file_path not in deco_module_paths:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was commenting on this in the cards PR (not yet submitted) but what happens if multiple decorators want to include the same file but as different name? I am not sure whey this can happen but it seems a possibility if you expose the name (which is then ignored) to the plugins.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point.

Some context to why I have kept the path_tuples as a return type from add_to_package:

  1. path_tuples consists of path,arcname
  2. arcname is useful when we untar the package on a remote execution as the relative path in the arcname makes a difference in how imports will happen on the remote container. Here is an example in the Cards PR where I needed to ensure correct path in the arcname to ensure remote imports works correctly.

Now coming to your point. Yes, I don't currently check for same file but as different name and I can change that. I can check for both the file_path as well as the arcname by adding an object like arcname-filepath in deco_module_paths.

Does that resolve this issue ?

@@ -239,6 +239,12 @@ def package_init(self, flow, step_name, environment):
"""
pass

def add_to_package(self):
"""
Called to add custom packages needed for a decorator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more descriptive about what this should return. It seems its a tuple of (path, name) although see my comment about the name.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick follow up, what problem does this PR help to solve? Is it an alternative to system pip install?

Copy link
Collaborator Author

@valayDave valayDave Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an alternative to pip install.

Context

This PR introduces a decorator hook that allows the decorator to include files in the metaflow code package. As decorators plug into different abstractions of the Metaflow stack, we may require custom package-related files based on the type of decorator we are implementing. This is where that hook becomes essential.

Example

In #815 the @card decorator introduces the concept of custom card modules (metaflow_cards). These modules are external packages that may be present on your local python path; Including those for remote execution requires those modules to be a part of the Metaflow code package ( or installed externally via @conda)

Using this decorator hook, We detect and bundle along those modules with the metaflow code package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an alternative to pip install.
In #815 the @card decorator introduces the concept of custom card modules (metaflow_cards). These modules are external packages that may be present on your local python path; Including those for remote execution requires those modules to be a part of the Metaflow code package ( or installed externally via @conda)

how would one specify a custom package here that is not installable via conda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused to how to answer this so I am decompressing what I wrote earlier. Apologies if this seems like an information dump or if I don't cover what you may be seeking for.

Just for clarification: As this is a decorator hook, these custom packages may be present to fulfill the requirements of the decorator. So they may not be a part of "userland" i.e the users don't need to explicitly specify the packages in the decorator invocation but rather the decorator automatically seeks them and includes them with metaflow code package.

These packages may have been pip installed into the local python environment or present in PYTHONPATH or lie somewhere in the filesystem. Continuing the example of @card decorator, @card decorator seeks for metaflow_cards modules present in the local python path / pip packages. The metaflow_cards module in the current context is a plugin-like module which behaves like a namespace package. As a part of the MF run lifecycle we seek these modules from the current python path / pip packages and include them in the code package. We can even install them via conda if they are published on a conda channel.

The main purpose of this hook is to include files to package before run starts ( code files that are collected through the MetaflowPackage class ).

Why do we use this hook

The @card decorator uses other types of files except for python files such as html , css, and js files to create a HTML that visualize the results of a Task; This requires that we seek filetypes that are generally not included in the metaflow code package. Hence this hook helps seek and include them as a part of the code package. Even the custom metaflow_cards modules that we install can have other types of files like html , css, js, etc which get included by default because of this decorator hook.

Example of metaflow_cards Namespace package.

As metaflow_cards is a namespace package; we can install it's subpackages using pip install metaflow-cards-container-monitoring; Once these packages are installed in the local python path or in pip library, they will be seeked by the card decorator and then included in the metaflow code package.

Gotacha's with directly seeking Installed Modules

As we are directly seeking modules and including them in the code package, the dependencies needed by these packages may not be covered ; This is something which we don't seek (at the moment)( haven't thought deeply upon this) .

Hope this addresses what you were seeking for ? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you for the details! My initial understanding of this change was incorrect - I had thought this would allow anyone to specify a list of packages required by the decorator and would attempt to pip install these list of packages. After talking to @romain-intel I have more context on this change. IIUC, this change includes the files required by the decorator in the code package. The files themselves could have been installed through one of several ways: pip, conda or simply copying the files on the FS.

While this doesn't solve the larger custom package and dependency management problem that we have, I think this work well for a certain set of use cases that we have internally.

Thanks for the PR and the explanation!

@@ -239,6 +239,13 @@ def package_init(self, flow, step_name, environment):
"""
pass

def add_to_package(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example on how this could be used?

def add_to_package(self):
"""
Called to add custom packages needed for a decorator.
Returns a list of tuples where each tuple represents (file_path,arcname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please detail what file_path and arcname here represent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MetaflowPackage class consists of a path_tuples function. This instance method helps seek the files to include for the code package. The path_tuples compiles the tarball of all the code files to include for execution. The function returns a list of tuples which consists of file_path and arcname; The filepath is the path of the file in the local FS to include and the arcname is the name/path of the file in the tarball when it is unpacked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

file_path, file_name = path_tuple
# Check if the path is not duplicated as
# many steps can have the same packages being imported
check_name = "%s-%s" % (file_path, file_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am being a bit too pedantic here but I would actually do something like this:

check_dict = {}
if file_name not in check_dict or check_dict[file_name] == file_path:
  check_dict[file_name] = file_path
  yield path_tuple
elif file_name in check_dict:
  # Raise some error saying that we already included a file for that path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this change seems much better than what I committed but I have a small alteration to propose to it:

if file_name not in deco_module_paths:
    deco_module_paths[file_name] = file_path
    yield path_tuple
elif deco_module_paths[file_name] == file_path:
    pass
elif file_name in deco_module_paths: # this means non unique file_name to file_path mapping. 
    raise NonUniqueFileNameToFilePathMappingException(
        file_name, [deco_module_paths[file_name],file_path]
    )

With the change you suggested we yield path_tuple even when check_dict[file_name] == file_path which we don't want to do to avoid deduplication; Does that make sense ?

Also Piggbacking to what you said :

It also seems necessary to raise an error for this as it violates the uniqueness constraint of file_name to file_path mapping. There can be scenarios that people have a python package of metaflow-cards-scorecard installed via pip and the same is also present in the python path. In such cases we should fail with and error that file_name related to the following file_paths and needs to only have a unique file_name to file_path mapping.

Just a QQ:
What should be a name for this exception (Naming things is hard 😄 ) I have for now called it NonUniqueFileNameToFilePathMappingException

@valayDave valayDave mentioned this pull request Nov 26, 2021
13 tasks
yield path_tuple
elif deco_module_paths[file_name] == file_path:
pass
elif (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should replace just with "else:"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a just else we will throw an Exception even if deco_module_paths[file_name] == file_path (which we don't want to) . Rather I can change it to elif deco_module_paths[file_name] != file_path; this will keep logic the same

@@ -148,7 +149,8 @@ digraph Metaflow {
/* package */
validate_dag -> init_environment
init_environment -> package_init
package_init -> add_to_package
package_init -> add_custom_package
add_custom_package -> add_to_package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valayDave Can you also regenerate the new image?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix:

  • Aapo's comment
  • the tiny format issue
  • regenerate the image
    I will then merge.

called in the `MetaflowPackage` class where metaflow compiles the code package
tarball. This hook is invoked in the `MetaflowPackage`'s `path_tuples`
function. The `path_tuples` function is a generator that yields a tuple of
`(file_path,arcname)`.`file_path` is the path of the file in the local file system;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after comma

def add_to_package(self):
"""
Called to add custom packages needed for a decorator.
Returns a list of tuples where each tuple represents (file_path,arcname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

yield path_tuple
elif deco_module_paths[file_name] == file_path:
pass
elif (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- nit fixes (spaces after commas)
- lifecycle PNG updated.
@valayDave
Copy link
Collaborator Author

@romain-intel : Made the changes. We are merge ready

@romain-intel romain-intel merged commit 5305899 into Netflix:master Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants