-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add_to_package
decorator hook.
#834
Conversation
valayDave
commented
Nov 23, 2021
- Cherry-picking diff from Added decorator hook for adding custom package #649 to reopen PR.
- Cherry picking diff from card branch.
cc @pjoshi30 |
@romain-intel / @savingoyal : please review when you get the time. |
metaflow/package.py
Outdated
file_path, _ = path_tuple | ||
# Check if the path is not duplicated as | ||
# many steps can have the same packages being imported | ||
if file_path not in deco_module_paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was commenting on this in the cards PR (not yet submitted) but what happens if multiple decorators want to include the same file but as different name? I am not sure whey this can happen but it seems a possibility if you expose the name (which is then ignored) to the plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point.
Some context to why I have kept the path_tuples
as a return type from add_to_package
:
- path_tuples consists of
path,arcname
arcname
is useful when we untar the package on a remote execution as the relative path in thearcname
makes a difference in how imports will happen on the remote container. Here is an example in the Cards PR where I needed to ensure correct path in the arcname to ensure remote imports works correctly.
Now coming to your point. Yes, I don't currently check for same file but as different name
and I can change that. I can check for both the file_path
as well as the arcname
by adding an object like arcname-filepath
in deco_module_paths
.
Does that resolve this issue ?
metaflow/decorators.py
Outdated
@@ -239,6 +239,12 @@ def package_init(self, flow, step_name, environment): | |||
""" | |||
pass | |||
|
|||
def add_to_package(self): | |||
""" | |||
Called to add custom packages needed for a decorator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be more descriptive about what this should return. It seems its a tuple of (path, name) although see my comment about the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick follow up, what problem does this PR help to solve? Is it an alternative to system pip install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an alternative to pip install.
Context
This PR introduces a decorator hook that allows the decorator to include files in the metaflow code package. As decorators plug into different abstractions of the Metaflow stack, we may require custom package-related files based on the type of decorator we are implementing. This is where that hook becomes essential.
Example
In #815 the @card
decorator introduces the concept of custom card modules (metaflow_cards
). These modules are external packages that may be present on your local python path; Including those for remote execution requires those modules to be a part of the Metaflow code package ( or installed externally via @conda
)
Using this decorator hook, We detect and bundle along those modules with the metaflow code package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an alternative to pip install.
In #815 the @card decorator introduces the concept of custom card modules (metaflow_cards). These modules are external packages that may be present on your local python path; Including those for remote execution requires those modules to be a part of the Metaflow code package ( or installed externally via @conda)
how would one specify a custom package here that is not installable via conda
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused to how to answer this so I am decompressing what I wrote earlier. Apologies if this seems like an information dump or if I don't cover what you may be seeking for.
Just for clarification: As this is a decorator hook, these custom packages may be present to fulfill the requirements of the decorator. So they may not be a part of "userland" i.e the users don't need to explicitly specify the packages in the decorator invocation but rather the decorator automatically seeks them and includes them with metaflow code package.
These packages may have been pip installed into the local python environment or present in PYTHONPATH or lie somewhere in the filesystem. Continuing the example of @card
decorator, @card
decorator seeks for metaflow_cards
modules present in the local python path / pip packages. The metaflow_cards
module in the current context is a plugin-like module which behaves like a namespace package. As a part of the MF run lifecycle we seek these modules from the current python path / pip packages and include them in the code package. We can even install them via conda
if they are published on a conda
channel.
The main purpose of this hook is to include files to package before run starts ( code files that are collected through the MetaflowPackage
class ).
Why do we use this hook
The @card decorator uses other types of files except for python files such as html
, css
, and js
files to create a HTML that visualize the results of a Task
; This requires that we seek filetypes that are generally not included in the metaflow code package. Hence this hook helps seek and include them as a part of the code package. Even the custom metaflow_cards
modules that we install can have other types of files like html
, css
, js
, etc which get included by default because of this decorator hook.
Example of metaflow_cards
Namespace package.
As metaflow_cards
is a namespace package; we can install it's subpackages using pip install metaflow-cards-container-monitoring
; Once these packages are installed in the local python path or in pip library, they will be seeked by the card decorator and then included in the metaflow code package.
Gotacha's with directly seeking Installed Modules
As we are directly seeking modules and including them in the code package, the dependencies needed by these packages may not be covered ; This is something which we don't seek (at the moment)( haven't thought deeply upon this) .
Hope this addresses what you were seeking for ? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thank you for the details! My initial understanding of this change was incorrect - I had thought this would allow anyone to specify a list of packages required by the decorator and would attempt to pip install
these list of packages. After talking to @romain-intel I have more context on this change. IIUC, this change includes the files required by the decorator in the code package. The files themselves could have been installed through one of several ways: pip
, conda
or simply copying the files on the FS.
While this doesn't solve the larger custom package and dependency management problem that we have, I think this work well for a certain set of use cases that we have internally.
Thanks for the PR and the explanation!
- Fixed comment.
@@ -239,6 +239,13 @@ def package_init(self, flow, step_name, environment): | |||
""" | |||
pass | |||
|
|||
def add_to_package(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example on how this could be used?
metaflow/decorators.py
Outdated
def add_to_package(self): | ||
""" | ||
Called to add custom packages needed for a decorator. | ||
Returns a list of tuples where each tuple represents (file_path,arcname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please detail what file_path
and arcname
here represent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MetaflowPackage
class consists of a path_tuples
function. This instance method helps seek the files to include for the code package. The path_tuples
compiles the tarball of all the code files to include for execution. The function returns a list of tuples which consists of file_path
and arcname
; The filepath
is the path of the file in the local FS to include and the arcname
is the name/path of the file in the tarball when it is unpacked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
metaflow/package.py
Outdated
file_path, file_name = path_tuple | ||
# Check if the path is not duplicated as | ||
# many steps can have the same packages being imported | ||
check_name = "%s-%s" % (file_path, file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am being a bit too pedantic here but I would actually do something like this:
check_dict = {}
if file_name not in check_dict or check_dict[file_name] == file_path:
check_dict[file_name] = file_path
yield path_tuple
elif file_name in check_dict:
# Raise some error saying that we already included a file for that path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this change seems much better than what I committed but I have a small alteration to propose to it:
if file_name not in deco_module_paths:
deco_module_paths[file_name] = file_path
yield path_tuple
elif deco_module_paths[file_name] == file_path:
pass
elif file_name in deco_module_paths: # this means non unique file_name to file_path mapping.
raise NonUniqueFileNameToFilePathMappingException(
file_name, [deco_module_paths[file_name],file_path]
)
With the change you suggested we yield path_tuple
even when check_dict[file_name] == file_path
which we don't want to do to avoid deduplication; Does that make sense ?
Also Piggbacking to what you said :
It also seems necessary to raise an error for this as it violates the uniqueness constraint of file_name
to file_path
mapping. There can be scenarios that people have a python package of metaflow-cards-scorecard
installed via pip and the same is also present in the python path. In such cases we should fail with and error that file_name
related to the following file_path
s and needs to only have a unique file_name
to file_path
mapping.
Just a QQ:
What should be a name for this exception (Naming things is hard 😄 ) I have for now called it NonUniqueFileNameToFilePathMappingException
metaflow/package.py
Outdated
yield path_tuple | ||
elif deco_module_paths[file_name] == file_path: | ||
pass | ||
elif ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should replace just with "else:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a just else
we will throw an Exception even if deco_module_paths[file_name] == file_path
(which we don't want to) . Rather I can change it to elif deco_module_paths[file_name] != file_path
; this will keep logic the same
@@ -148,7 +149,8 @@ digraph Metaflow { | |||
/* package */ | |||
validate_dag -> init_environment | |||
init_environment -> package_init | |||
package_init -> add_to_package | |||
package_init -> add_custom_package | |||
add_custom_package -> add_to_package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valayDave Can you also regenerate the new image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix:
- Aapo's comment
- the tiny format issue
- regenerate the image
I will then merge.
metaflow/decorators.py
Outdated
called in the `MetaflowPackage` class where metaflow compiles the code package | ||
tarball. This hook is invoked in the `MetaflowPackage`'s `path_tuples` | ||
function. The `path_tuples` function is a generator that yields a tuple of | ||
`(file_path,arcname)`.`file_path` is the path of the file in the local file system; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after comma
metaflow/decorators.py
Outdated
def add_to_package(self): | ||
""" | ||
Called to add custom packages needed for a decorator. | ||
Returns a list of tuples where each tuple represents (file_path,arcname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
metaflow/package.py
Outdated
yield path_tuple | ||
elif deco_module_paths[file_name] == file_path: | ||
pass | ||
elif ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
- nit fixes (spaces after commas) - lifecycle PNG updated.
@romain-intel : Made the changes. We are merge ready |