Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Commit

Permalink
Cloudmlm (#152)
Browse files Browse the repository at this point in the history
* Add gcs_copy_file() that is missing but is referenced in a couple of places. (#110)

* Add gcs_copy_file() that is missing but is referenced in a couple of places.

* Add DataFlow to pydatalab dependency list.

* Fix travis test errors by reimplementing gcs copy.

* Remove unnecessary shutil import.

* Flake8 configuration. Set max line length to 100. Ignore E111, E114 (#102)

* Add datalab user agent to CloudML trainer and predictor requests. (#112)

* Update oauth2client to 2.2.0 to satisfy cloudml in Cloud Datalab (#111)

* Update README.md (#114)

Added docs link.

* Generate reST documentation for magic commands (#113)

Auto generate docs for any added magics by searching through the source files for lines with register_line_cell_magic, capturing the names for those magics, and calling them inside an ipython kernel with the -h argument, then storing that output into a generated datalab.magics.rst file.

* Fix an issue that %%chart failed with UDF query. (#116)

* Fix an issue that %%chart failed with UDF query.

The problem is that the query is submitted to BQ without replacing variable values from user namespace.

* Fix chart tests by adding ip.user_ns mock.

* Fix charting test.

* Add missing import "mock".

* Fix chart tests.

* Fix "%%bigquery schema" issue --  the command generates nothing in output. (#119)

* Add some missing dependencies, remove some unused ones (#122)

* Remove scikit-learn and scipy as dependencies
* add more required packages
* Add psutil as dependency
* Update packages versions

* Cleanup (#123)

* Remove unnecessary semicolons

* remove unused imports

* remove unncessary defined variable

* Fix query_metadata tests (#128)

Fix query_metadata tests

* Make the library pip-installable (#125)

This PR adds tensorflow and cloudml in setup.py to make the lib pip-installable. I had to install them explicitly using pip from inside the setup.py script, even though it's not a clean way to do it, it gets around the two issues we have at the moment with these two packags:
- Pypi has Tensorflow version 0.12, while we need 0.11 for the current version of pydatalab. According to the Cloud ML docs, that version exists as a pip package for three supported platforms.
- Cloud ML SDK exists as a pip package, but also not on Pypi, and while we could add it as a dependency link, there exists another package on Pypi called cloudml, and pip ends up installing that instead (see #124). I cannot find a way to force pip to install the package from the link I included.

* Set command description so it is displayed in --help. argparser's format_help() prints description but not help. (#131)

* Fix an issue that setting project id from datalab does not set gcloud default project. (#136)

* Add future==0.16.0 as a dependency since it's required by CloudML SDK (#143)

As of the latest release of CloudML Python SDK, that package seems to require future==0.16.0, so until it's fixed, we'll take it as a dependency.

* Remove tensorflow and CloudML SDK from setup.py (#144)

* Install TensorFlow 0.12.1.

* Remove TensorFlow and CloudML SDK from setup.py.

* Add comments why we ignore errors when importing mlalpha.

* Adding evaluationanalysis API to generate evaluation stats from eval … (#99)

* Adding evaluationanalysis API to generate evaluation stats from eval source CSV file and eval results CSV file.

The resulting stats file will be fed to a visualization component which will come in a separate change.

* Follow up CR comments.

* Feature slicing view visualization component. (#109)

* Datalab Inception (image classification) solution. (#117)

* Datalab Inception (image classification) solution.

* Fix dataflow URL.

* Datalab "ml" magics for running a solution package. Update Inception Package. (#121)

* Datalab Inception (image classification) solution.

* Fix dataflow URL.

* Datalab "ml" magics for running a solution package.
 - Dump function args and docstrings
 - Run functions
Update Inception Package.
 - Added docstring on face functions.
 - Added batch prediction.
 - Use datalab's lib for talking to cloud training and prediction service.
 - More minor fixes and changes.

* Follow up code review comments.

* Fix an PackageRunner issue that temp installation is done multiple times unnecessarily.

* Update feature-slice-view supporting file, which fixes some stability UI issues. (#126)

* Remove old feature-slicing pipeline implementation (is replaced by BigQuery)  Add Confusion matrix magic. (#129)

* Remove old feature-slicing pipeline implementation (is replaced by BigQuery).
Add Confusion matrix magic.

* Follow up on code review comments. Also fix an inception issue that eval loss is nan when eval size is smaller than batch size.

* Fix set union.

* Mergemaster/cloudml (#134)

* Add gcs_copy_file() that is missing but is referenced in a couple of places. (#110)

* Add gcs_copy_file() that is missing but is referenced in a couple of places.

* Add DataFlow to pydatalab dependency list.

* Fix travis test errors by reimplementing gcs copy.

* Remove unnecessary shutil import.

* Flake8 configuration. Set max line length to 100. Ignore E111, E114 (#102)

* Add datalab user agent to CloudML trainer and predictor requests. (#112)

* Update oauth2client to 2.2.0 to satisfy cloudml in Cloud Datalab (#111)

* Update README.md (#114)

Added docs link.

* Generate reST documentation for magic commands (#113)

Auto generate docs for any added magics by searching through the source files for lines with register_line_cell_magic, capturing the names for those magics, and calling them inside an ipython kernel with the -h argument, then storing that output into a generated datalab.magics.rst file.

* Fix an issue that %%chart failed with UDF query. (#116)

* Fix an issue that %%chart failed with UDF query.

The problem is that the query is submitted to BQ without replacing variable values from user namespace.

* Fix chart tests by adding ip.user_ns mock.

* Fix charting test.

* Add missing import "mock".

* Fix chart tests.

* Fix "%%bigquery schema" issue --  the command generates nothing in output. (#119)

* Add some missing dependencies, remove some unused ones (#122)

* Remove scikit-learn and scipy as dependencies
* add more required packages
* Add psutil as dependency
* Update packages versions

* Cleanup (#123)

* Remove unnecessary semicolons

* remove unused imports

* remove unncessary defined variable

* Fix query_metadata tests (#128)

Fix query_metadata tests

* Make the library pip-installable (#125)

This PR adds tensorflow and cloudml in setup.py to make the lib pip-installable. I had to install them explicitly using pip from inside the setup.py script, even though it's not a clean way to do it, it gets around the two issues we have at the moment with these two packags:
- Pypi has Tensorflow version 0.12, while we need 0.11 for the current version of pydatalab. According to the Cloud ML docs, that version exists as a pip package for three supported platforms.
- Cloud ML SDK exists as a pip package, but also not on Pypi, and while we could add it as a dependency link, there exists another package on Pypi called cloudml, and pip ends up installing that instead (see #124). I cannot find a way to force pip to install the package from the link I included.

* Set command description so it is displayed in --help. argparser's format_help() prints description but not help. (#131)

* Fix an issue that prediction right after preprocessing fails in inception package local run. (#135)

* add structure data preprocessing and training  (#132)

merging the preprocessing and training parts.

* first full-feature version of structured data is done (#139)

* added the preprocessing/training files.

Preprocessing is connected with datalab. Training is not fully connected
with datalab.

* added training interface.

* local/cloud training ready for review

* saving work

* saving work

* cloud online prediction is done.

* split config file into two (schema/transforms) and updated the
unittests.

* local preprocess/train working

* 1) merged --model_type and --problem_type
2) online/local prediction is done

* added batch prediction

* all prediction is done. Going to make a merge request next

* Update _package.py

removed some white space + add a print statement to  local_predict

* --preprocessing puts a copy of schema in the outut dir.
--no need to pass schema to train in datalab.

* tests can be run from any folder above the test folder by

python -m unittest discover

Also, the training test will parse the output of training and check that
the loss is small.

* Inception Package Improvements (#138)

* Fix an issue that prediction right after preprocessing fails in inception package local run.

* Remove the "labels_file" parameter from inception preprocess/train/predict. Instead it will get labels from training data. Prediction graph will return labels.
Make online prediction works with GCS images.
"%%ml alpha deploy" now also check for "/model" subdir if needed.
Other minor improvements.

* Make local batch prediction really batched.
Batch prediction input may not have to include target column.
Sort labels, so it is consistent between preprocessing and training.
Follow up other core review comments.

* Follow up code review comments.
  • Loading branch information
qimingj authored Feb 1, 2017
1 parent e92b790 commit b4e096e
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 10 deletions.
33 changes: 25 additions & 8 deletions datalab/context/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import oauth2client.client
import json
import os
import subprocess


# TODO(ojarjur): This limits the APIs against which Datalab can be called
Expand Down Expand Up @@ -90,21 +91,37 @@ def save_project_id(project_id):
Args:
project_id: the project_id to save.
"""
config_file = os.path.join(get_config_dir(), 'config.json')
config = {}
if os.path.exists(config_file):
with open(config_file) as f:
config = json.loads(f.read())
config['project_id'] = project_id
with open(config_file, 'w') as f:
f.write(json.dumps(config))
# Try gcloud first. If gcloud fails (probably because it does not exist), then
# write to a config file.
try:
subprocess.call(['gcloud', 'config', 'set', 'project', project_id])
except:
config_file = os.path.join(get_config_dir(), 'config.json')
config = {}
if os.path.exists(config_file):
with open(config_file) as f:
config = json.loads(f.read())
config['project_id'] = project_id
with open(config_file, 'w') as f:
f.write(json.dumps(config))


def get_project_id():
""" Get default project id from config or environment var.
Returns: the project id if available, or None.
"""
# Try getting default project id from gcloud. If it fails try config.json.
try:
proc = subprocess.Popen(['gcloud', 'config', 'list', '--format', 'value(core.project)'],
stdout=subprocess.PIPE)
stdout, _ = proc.communicate()
value = stdout.strip()
if proc.poll() == 0 and value:
return value
except:
pass

config_file = os.path.join(get_config_dir(), 'config.json')
if os.path.exists(config_file):
with open(config_file) as f:
Expand Down
12 changes: 11 additions & 1 deletion datalab/kernel/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,21 @@
import datalab.bigquery.commands
import datalab.context.commands
import datalab.data.commands
import datalab.mlalpha.commands
import datalab.stackdriver.commands
import datalab.storage.commands
import datalab.utils.commands

# mlalpha modules require TensorFlow, CloudML SDK, and DataFlow (installed with CloudML SDK).
# These are big dependencies and users who want to use Bigquery/Storage features may not
# want to install them.
# This __init__.py file is called when Jupyter/Datalab loads magics on startup. We don't want
# Jupyter+pydatalab fail to start because of missing TensorFlow/DataFlow. So we ignore import
# errors on mlalpha commands.
try:
import datalab.mlalpha.commands
except:
print('TensorFlow and CloudML SDK are required.')


_orig_request = _httplib2.Http.request
_orig_init = _requests.Session.__init__
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
for accessing Google's Cloud Platform services such as Google BigQuery.
""",
install_requires=[
'future==0.15.2',
'future==0.16.0',
'futures==3.0.5',
'google-cloud==0.19.0',
'google-api-python-client==1.5.1',
Expand Down

0 comments on commit b4e096e

Please sign in to comment.