Skip to content

Commit

Permalink
ENH Register models trained outside of Civis Platform (#242)
Browse files Browse the repository at this point in the history
If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.
  • Loading branch information
Stephen Hoover authored Mar 26, 2018
1 parent ee2947f commit 242d2e2
Show file tree
Hide file tree
Showing 3 changed files with 187 additions and 8 deletions.
153 changes: 150 additions & 3 deletions civis/ml/_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,11 @@
9112: 9113, # v1.1
8387: 9113, # v1.0
7020: 7021, # v0.5
11028: 10616, # v2.2 registration CHANGE ME
}
_CIVISML_TEMPLATE = None # CivisML training template to use
REGISTRATION_TEMPLATES = [11028, # v2.2 CHANGE ME
]


class ModelError(RuntimeError):
Expand Down Expand Up @@ -631,10 +634,10 @@ class ModelPipeline:
See :func:`~civis.resources._resources.Scripts.post_custom` for
further documentation about email and URL notification.
dependencies : array, optional
List of packages to install from PyPI or git repository (i.e., Github
List of packages to install from PyPI or git repository (e.g., Github
or Bitbucket). If a private repo is specified, please include a
``git_token_name`` argument as well (see below). Make sure to pin
dependencies to a specific version, since dependecies will be
dependencies to a specific version, since dependencies will be
reinstalled during every training and predict job.
git_token_name : str, optional
Name of remote git API token stored in Civis Platform as the password
Expand Down Expand Up @@ -713,6 +716,8 @@ def _get_template_ids(self, client):
global _CIVISML_TEMPLATE
if _CIVISML_TEMPLATE is None:
for t_id in sorted(_PRED_TEMPLATES)[::-1]:
if t_id in REGISTRATION_TEMPLATES:
continue
try:
# Check that we can access the template
client.templates.get_scripts(id=t_id)
Expand Down Expand Up @@ -783,6 +788,147 @@ def __setstate__(self, state):
template_ids = self._get_template_ids(self._client)
self.train_template_id, self.predict_template_id = template_ids

@classmethod
def register_pretrained_model(cls, model, dependent_variable=None,
features=None, primary_key=None,
model_name=None, dependencies=None,
git_token_name=None,
skip_model_check=False, verbose=False,
client=None):
"""Use a fitted scikit-learn model with CivisML scoring
Use this function to set up your own fitted scikit-learn-compatible
Estimator object for scoring with CivisML. This function will
upload your model to Civis Platform and store enough metadata
about it that you can subsequently use it with a CivisML scoring job.
The only required input is the model itself, but you are strongly
recommended to also provide a list of feature names. Without a list
of feature names, CivisML will have to assume that your scoring
table contains only the features needed for scoring (perhaps also
with a primary key column), in all in the correct order.
Parameters
----------
model : sklearn.base.BaseEstimator or int
The model object. This must be a fitted scikit-learn compatible
Estimator object, or else the integer Civis File ID of a
pickle or joblib-serialized file which stores such an object.
dependent_variable : string or List[str], optional
The dependent variable of the training dataset.
For a multi-target problem, this should be a list of
column names of dependent variables.
features : string or List[str], optional
A list of column names of features which were used for training.
These will be used to ensure that tables input for prediction
have the correct features in the correct order.
primary_key : string, optional
The unique ID (primary key) of the scoring dataset
model_name : string, optional
The name of the Platform registration job. It will have
" Predict" added to become the Script title for predictions.
dependencies : array, optional
List of packages to install from PyPI or git repository (e.g.,
GitHub or Bitbucket). If a private repo is specified, please
include a ``git_token_name`` argument as well (see below).
Make sure to pin dependencies to a specific version, since
dependencies will be reinstalled during every predict job.
git_token_name : str, optional
Name of remote git API token stored in Civis Platform as
the password field in a custom platform credential.
Used only when installing private git repositories.
skip_model_check : bool, optional
If you're sure that your model will work with CivisML, but it
will fail the comprehensive verification, set this to True.
verbose : bool, optional
If True, supply debug outputs in Platform logs and make
prediction child jobs visible.
client : :class:`~civis.APIClient`, optional
If not provided, an :class:`~civis.APIClient` object will be
created from the :envvar:`CIVIS_API_KEY`.
Returns
-------
:class:`~civis.ml.ModelPipeline`
Examples
--------
This example assumes that you already have training data
``X`` and ``y``, where ``X`` is a :class:`~pandas.DataFrame`.
>>> from civis.ml import ModelPipeline
>>> from sklearn.linear_model import Lasso
>>> est = Lasso().fit(X, y)
>>> model = ModelPipeline.register_pretrained_model(
... est, 'concrete', features=X.columns)
>>> model.predict(table_name='my.table', database_name='my-db')
"""
client = client or APIClient()

if isinstance(dependent_variable, six.string_types):
dependent_variable = [dependent_variable]
if isinstance(features, six.string_types):
features = [features]
if isinstance(dependencies, six.string_types):
dependencies = [dependencies]
if not model_name:
model_name = ("Pretrained {} model for "
"CivisML".format(model.__class__.__name__))
model_name = model_name[:255] # Max size is 255 characters

if isinstance(model, (int, float, six.string_types)):
model_file_id = int(model)
else:
try:
tempdir = tempfile.mkdtemp()
fout = os.path.join(tempdir, 'model_for_civisml.pkl')
joblib.dump(model, fout, compress=3)
with open(fout, 'rb') as _fout:
# NB: Using the name "estimator.pkl" means that
# CivisML doesn't need to copy this input to a file
# with a different name.
model_file_id = cio.file_to_civis(_fout, 'estimator.pkl',
client=client)
finally:
shutil.rmtree(tempdir)

args = {'MODEL_FILE_ID': str(model_file_id),
'SKIP_MODEL_CHECK': skip_model_check,
'DEBUG': verbose}
if dependent_variable is not None:
args['TARGET_COLUMN'] = ' '.join(dependent_variable)
if features is not None:
args['FEATURE_COLUMNS'] = ' '.join(features)
if dependencies is not None:
args['DEPENDENCIES'] = ' '.join(dependencies)
if git_token_name:
creds = find(client.credentials.list(),
name=git_token_name,
type='Custom')
if len(creds) > 1:
raise ValueError("Unique credential with name '{}' for "
"remote git hosting service not found!"
.format(git_token_name))
args['GIT_CRED'] = creds[0].id

template_id = max(REGISTRATION_TEMPLATES)
container = client.scripts.post_custom(
from_template_id=template_id,
name=model_name,
arguments=args)
log.info('Created custom script %s.', container.id)

run = client.scripts.post_custom_runs(container.id)
log.debug('Started job %s, run %s.', container.id, run.id)

fut = ModelFuture(container.id, run.id, client=client,
poll_on_creation=False)
fut.result()
log.info('Model registration complete.')

mp = ModelPipeline.from_existing(fut.job_id, fut.run_id, client)
mp.primary_key = primary_key
return mp

@classmethod
def from_existing(cls, train_job_id, train_run_id='latest', client=None):
"""Create a :class:`ModelPipeline` object from existing model IDs
Expand Down Expand Up @@ -887,7 +1033,8 @@ def from_existing(cls, train_job_id, train_run_id='latest', client=None):
'prediction code. Prediction will either fail '
'immediately or succeed.'
% (train_job_id, __version__), RuntimeWarning)
p_id = max(_PRED_TEMPLATES.values())
p_id = max([v for k, v in _PRED_TEMPLATES.items()
if k not in REGISTRATION_TEMPLATES])
klass.predict_template_id = p_id

return klass
Expand Down
13 changes: 8 additions & 5 deletions civis/ml/tests/test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@
from civis.ml import _model


LATEST_TRAIN_TEMPLATE = 10582
LATEST_PRED_TEMPLATE = 10583


def setup_client_mock(script_id=-10, run_id=100, state='succeeded',
run_outputs=None):
"""Return a Mock set up for use in testing container scripts
Expand Down Expand Up @@ -682,7 +686,7 @@ def test_modelpipeline_init_newest():
mp = _model.ModelPipeline(LogisticRegression(), 'test', etl=etl,
client=mock_client)
assert mp.etl == etl
assert mp.train_template_id == max(_model._PRED_TEMPLATES)
assert mp.train_template_id == LATEST_TRAIN_TEMPLATE
# clean up
_model._CIVISML_TEMPLATE = None

Expand Down Expand Up @@ -787,16 +791,15 @@ def test_modelpipeline_classmethod_constructor_defaults(
def test_modelpipeline_classmethod_constructor_future_train_version():
# Test handling attempts to restore a model created with a newer
# version of CivisML.
current_max_template = max(_model._PRED_TEMPLATES)
cont = container_response_stub(current_max_template + 1000)
cont = container_response_stub(LATEST_TRAIN_TEMPLATE + 1000)
mock_client = mock.Mock()
mock_client.scripts.get_containers.return_value = cont
mock_client.credentials.get.return_value = Response({'name': 'Token'})

# test everything is working fine
with pytest.warns(RuntimeWarning):
mp = _model.ModelPipeline.from_existing(1, 1, client=mock_client)
exp_p_id = _model._PRED_TEMPLATES[current_max_template]
exp_p_id = _model._PRED_TEMPLATES[LATEST_TRAIN_TEMPLATE]
assert mp.predict_template_id == exp_p_id


Expand Down Expand Up @@ -892,7 +895,7 @@ def test_modelpipeline_train_df(mock_ccr, mock_stash, mp_setup):
train_data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
assert 'res' == mp.train(train_data)
mock_stash.assert_called_once_with(
train_data, max(_model._PRED_TEMPLATES.keys()), client=mock.ANY)
train_data, LATEST_TRAIN_TEMPLATE, client=mock.ANY)
assert mp.train_result_ == 'res'


Expand Down
29 changes: 29 additions & 0 deletions docs/source/ml.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,9 @@ or by providing your own scikit-learn
Note that whichever option you chose, CivisML will pre-process your
data using either its default ETL, or ETL that you provide (see :ref:`custom-etl`).

If you have already trained a scikit-learn model outside of Civis Platform,
you can register it with Civis Platform as a CivisML model so that you can
score it using CivisML. Read :ref:`model-registration` for how to do this.

Pre-Defined Models
------------------
Expand Down Expand Up @@ -359,6 +362,32 @@ for solving a problem. For example:
train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
aucs = [tr.metrics['roc_auc'] for tr in train] # Code blocks here
.. _model-registration:

Registering Models Trained Outside of Civis
===========================================

Instead of using CivisML to train your model, you may train any
scikit-learn-compatible model outside of Civis Platform and use
:meth:`civis.ml.ModelPipeline.register_pretrained_model` to register it
as a CivisML model in Civis Platform. This will let you use Civis Platform
to make predictions using your model, either to take advantage of distributed
predictions on large datasets, or to create predictions as part of
a workflow or service in Civis Platform.

When registering a model trained outside of Civis Platform, you are
strongly advised to provide an ordered list of feature names used
for training. This will allow CivisML to ensure that tables of data
input for predictions have the correct features in the correct order.
If your model has more than one output, you should also provide a list
of output names so that CivisML knows how many outputs to expect and
how to name them in the resulting table of model predictions.

If your model uses dependencies which aren't part of the default CivisML
execution environment, you must provide them to the ``dependencies``
parameter of the :meth:`~civis.ml.ModelPipeline.register_pretrained_model`
function, just as with the :class:`~civis.ml.ModelPipeline` constructor.


Object reference
================
Expand Down

0 comments on commit 242d2e2

Please sign in to comment.