Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations - separate out plugins into top level folder #339

Merged
merged 23 commits into from
Jan 20, 2021

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Jan 17, 2021

This is the last remaining item we hope in the beta release, the flytekit plugin system. This is because this will affect how users install and import plugin tasks (like Spark or Pytorch, or plugin types like Pandera, etc.)

This PR attempts one possible solution, using namespace packages. For those unfamiliar, please take a look at this blog post first https://medium.com/@jherreras/python-microlibs-5be9461ad979 as this PR is heavily based off of that. Also see PEP 420 for more info. Also see the existing vaex setup which is where I found the blog posting.

We rethought the plugin code structure with the following in mind.

  • When we remove the alpha/beta qualifiers on the new API, we'll likely release it as 0.16.x or 0.17, it will not be 1.0. That is, we cannot break existing users (old plugin code imports should continue to work).
  • For example, we'd like pip install flytekit[spark] to continue to install the existing master-branch spark functionality. But we don't have to install the new spark stuff, based on the annotations branch.
  • For that, users should install something like flytekitplugins-spark.
  • Flytekit plugin code that is more split out, perhaps even in another repo, is cleaner because plugin systems should be more delineated and distinct. This separation ensures that.

This PR

  • Moves all the new annotations branch taskplugins code into separate flytekitplugins namespace package folders, under the new top-level folder (outside of the highest level flytekit folder). For all intents and purposes, you can think of this folder as a separate repo.
  • Moves all the tasks for each plugin type into a separate tests folder in each plugin.
  • Adds a setup.py file for each plugin
  • Adds a not quite working local-development only (read: not a real Pypi package) setup.py for the top level flytekitplugins folder.
  • Changes the plugin pandas imports to not be from the flytekit lazy loader, but pandas for reals.
  • Added a post_init for the Spark task config dataclass to instantiate empty Spark and Hadoop configs if missing.

Upsides of doing things this way.

  • Enforces separation.
  • Promotes ease of new contributions.

Downsides of doing things this way.

  • Since separation is enforced, we need a lot more stuff. Each one will have its own setup.py, requirements file, etc. If you want to bump the minimum flytekit version (which now is an explicit dependency), you have to go in to each plugin's setup and bump the version.
  • Local development no longer works without installing. When writing flytekit, often i will export PYTHONPATH=:$PYTHONPATH to basically add the current directory to the list of places where Python searches for code. This is no longer possible for the plugins. This is because the location of the pytorch plugin is now plugins/pytorch/flytekitplugins/pytorch/blah.py. When using something defined there, if you just write from flytekitplugins.pytorch.blah import MyBlah it won't find it. It'd only work if you write from plugins.pytorch.flytekitplugins.pytorch.blah import MyBlah but of course you can't do that, because that won't work for after it's been deployed.
  • This means you have to pip install -e . each plugin. The -e means development mode install, and it basically adds an egg link from your site-packages to the directory where you ran that command. This is fine, just more steps.
  • NB Some weirdness around tests. Note that the tests for the papermill plugin is called papermilltests under the plugins/tests folder. The reason for this is because the tests were failing when the folder was just called plugins/tests/papermill. The reason it failed is because when pytest runs, the papermill package that is found ends up being the papermill folder in the test folder itself. When in the real code we write import papermill as pm we end up getting this test package instead of the real papermill package.

Additional notes.

  • Unclear if we should keep all the plugins as one repo, or have each plugin be its own github repo. I'm inclined to the former, as is the author of the blog (search for "keep each library in a single repository").
  • Even though pandas is large I'd like to propose that we bundle it with flytekit. In other words, I think the Schema type should be included when you do pip install flytekit. I don't think we should make users pip install flytekitplugins-schema -- it's just too central.

@wild-endeavor wild-endeavor changed the title Annotations imports [wip] Annotations imports Jan 17, 2021

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit==0.16.0a2", "hmsclient>=0.0.1,<1.0.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be pinned right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pip resolver has stopped accepting conflicting versions, and that means if plugin foo requires ==0.17.0 while plugin bar requires ==0.17.1, pip will not install both foo and bar.

I think having lower and upper bounds instead of pinned version is what we should do.

plugins/notebook/setup.py Outdated Show resolved Hide resolved
plugins/pytorch/setup.py Outdated Show resolved Hide resolved
@@ -0,0 +1,32 @@
from setuptools import setup

PLUGIN_NAME = "sagemaker"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about putting all aws related plugins in one place? this the plugin is aws and we can have aws.sagemaker as the first thing in there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, but I think this will get pretty big. In the future we should do aws[sagemaker]? but for now install everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess you are right, just keep sagemaker separate - then?

plugins/setup.py Outdated Show resolved Hide resolved
plugins/setup.py Outdated Show resolved Hide resolved
plugins/sqlite3/setup.py Outdated Show resolved Hide resolved
@kumare3
Copy link
Contributor

kumare3 commented Jan 17, 2021

@wild-endeavor - WOW! Thank you for doing this. A few comments.
Also we have decided to flytekit as is right - as-in we wont split flytekit into legacy and new right?

@honnix
Copy link
Member

honnix commented Jan 18, 2021

Some thoughts around downsides:

Since separation is enforced, we need a lot more stuff...

We can add a boilerplate script to bootstrap a new plugin, and another script to bump versions.

Local development no longer works without installing...

I guess as long as we well document pip install -e ., it should be totally fine. Actually I have been only using this way, as it is more or less a standard/common practice.

@honnix
Copy link
Member

honnix commented Jan 18, 2021

Regarding repositories, I think keeping all plugins in one repo is sufficient, given most cases the plugins being relatively small. Having too many repos may require a lot more maintenance, e.g. versioning, CI.

I think it is also OK to postpone repo split at this moment, because that still seems like a luxury problem that we don't have to deal with immediately. Consider this as my Luigi maintenance experience, but of course it has very different setup. :)

flytekit/annotated/type_engine.py Outdated Show resolved Hide resolved
version="0.1.0",
author="flyteorg",
author_email="admin@flyte.org",
description="Your microlib descriton",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive blabla


microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit==0.16.0a2", "hmsclient>=0.0.1,<1.0.0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new pip resolver has stopped accepting conflicting versions, and that means if plugin foo requires ==0.17.0 while plugin bar requires ==0.17.1, pip will not install both foo and bar.

I think having lower and upper bounds instead of pinned version is what we should do.

@@ -1,13 +1,13 @@
import pandas
import pytest
from plugins.hive.flytekitplugins.hive.task import HiveConfig, HiveSelectTask, HiveTask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand this absolute import but still a bit strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh no this is wrong, i'll fix.

plugins/setup.py Outdated
install_requires=[],
cmdclass={
'install': InstallCmd,
'develop': DevelopCmd,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running pip install -e . under plugins but that didn't seem to install any package. Am I supposed to run python setup.py develop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python setup.py develop doesn't work for me either. It logs everything being OK, but nothing gets installed, apart from flytekitplugins-parent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM I recreated my virtualenv and it is now working.

plugins/pod/setup.py Outdated Show resolved Hide resolved
wild-endeavor and others added 3 commits January 18, 2021 08:40
Co-authored-by: Honnix <honnix@users.noreply.github.com>
Co-authored-by: Honnix <honnix@users.noreply.github.com>
@wild-endeavor
Copy link
Contributor Author

So @honnix it sounds like you don't think this PR is a horrible idea? I'll work on cleaning it up and addressing all the other comments today if that's the case.

Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work with the lazy loader stuff? can we refactor that too?

@wild-endeavor wild-endeavor changed the title [wip] Annotations imports Annotations imports Jan 19, 2021
@wild-endeavor wild-endeavor changed the title Annotations imports Annotations - separate out plugins into top level folder Jan 19, 2021

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

plugin_requires = ["flytekit>=0.16.0a2"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less than 1.0.0?

@@ -28,6 +28,13 @@ class Spark(object):
spark_conf: Optional[Dict[str, str]] = None
hadoop_conf: Optional[Dict[str, str]] = None

def __post_init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

@@ -4,6 +4,7 @@

from flyteidl.plugins.sagemaker import hyperparameter_tuning_job_pb2 as _pb2_hpo_job
from flyteidl.plugins.sagemaker import parameter_ranges_pb2 as _pb2_params
from flytekitplugins.aws.training import SagemakerBuiltinAlgorithmsTask, SagemakerCustomTrainingTask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not mean rename sagemaker -> aws, I meant move sagemaker -> aws/sagemaker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants