-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a working Docker setup for developing sparkmagic #361
Conversation
It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README
Now you can just launch a PySpark wrapper kernel and have it work out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic to see! 🙇
I think it's pretty much ready but for some minor comments. I'm trying it out now, but wanted to give some feedback ASAP. I'll update with any comments if I find anything once I try with Docker.
|
||
ENV LIVY_BUILD_VERSION livy-server-0.3.0 | ||
ENV LIVY_APP_PATH /apps/$LIVY_BUILD_VERSION | ||
ENV LIVY_BUILD_PATH /apps/build/livy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to create a python3 environment and set the PYSPARK3_PYTHON
variable as explained in https://github.com/cloudera/livy/blob/511a05f2282cd85a457017cc5a739672aaed5238/README.rst#pyspark3
I'd recommend installing Anaconda to create the environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be the same for PYSPARK_PYTHON
, but if it's not set, it will just take the system's python
, which is fine as long as it's a 2.7.X
version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something similar to this:
ANACONDA_DEST_PATH=/usr/bin/anaconda
CONDA_PATH=$ANACONDA_DEST_PATH/bin/conda
# TODO wget anaconda to ANACONDA_DEST
bash $INSTALLER_PATH -p $ANACONDA_DEST_PATH -b -f
# Create an env for Python 3.
$CONDA_PATH create -n py35 python=3.5 anaconda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the other changes, but I'm not sure I understand the need for this one. The base image I'm using for the Livy container (gettyimages/spark:2.1.0-hadoop-2.7
) already has a Python3 installation which seems to be working just fine with Livy and sparkmagic as-is.
Do you just mean creating a virtualenv for it? I can do that, though I'm not sure it's necessary if we're just running inside a single-purpose container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear. I meant we need to install the following two kernels, besides the kernels already being installed:
- pysparkkernel
- sparkkernel
You can just add sparkkernel
to the list of kernels to install, and it should just work.
pysparkkernel
is different in that it should run in a python 2.7 environment whereas pyspark3kernel
would run in a python 3 environment. Because the container's python installation is python 3, we need to tell livy where to find the python 2 installation in the image (created via virtualenv or anaconda or something else). The way livy finds the two installations is via the env variables i mentioned above.
The point would be to try to show users that one can have two python versions running fine side by side and how to set it up. This is some extra work, and I would be OK with you just adding sparkkernel
for now and me creating an issue to add pysparkkernel
to the docker image to tackle later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean then :) Okay, let me add that real quick.
Dockerfile.jupyter
Outdated
COPY sparkmagic/example_config.json /home/$NB_USER/.sparkmagic/config.json | ||
RUN sed -i 's/localhost/spark/g' /home/$NB_USER/.sparkmagic/config.json | ||
RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension | ||
RUN jupyter-kernelspec install --user $(pip show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pyspark3kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be installing:
- sparkkernel
- pysparkkernel
- pyspark3kernel
sparkrkernel (?)
Is R properly set up for Livy on the image? I'll try it as soon as I'm done with the review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so R is not set up properly for Livy. Let's disable installing the SparkR kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md
Outdated
You will then be able to access the Jupyter notebook in your browser at | ||
http://localhost:8888. Inside this notebook, you can configure a | ||
sparkmagic endpoint at http://spark:8998. This endpoint is able to | ||
launch both Scala and Python sessions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention that the managed sparkmagic kernels will be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All except SparkR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
then simply run: | ||
|
||
docker-compose build | ||
docker-compose up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be a good idea to add instructions for exiting...
Ctrl-C
+ docker-compose down
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there?
Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container.
Thanks for the updates! If you can just add Thanks for your contribution! |
Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container.
Comments addressed :) Debian has separate packages+environments for python2 and python3, so it was just a matter of making sure both are installed and that I confirmed manually that they are in fact in separate environments and the right kernel is pointing to the right version. |
Thanks for the awesome contribution! I know a lot of people will find it very useful. |
* Make location of config.json file configurable using environment variables (#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (#365) * Remove extra line * bump version * Optional coerce (#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config
* Remove extra line * Release 0.12.0 (#373) * Make location of config.json file configurable using environment variables (#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (#365) * Remove extra line * bump version * Optional coerce (#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config * bumping versions
* Release 0.12.0 (jupyter-incubator#373) * Make location of config.json file configurable using environment variables (jupyter-incubator#350) * Make location of config.json file configurable using environment variables * Update minor version to 0.11.3 * Fix column drop issue when first row has missing value (jupyter-incubator#353) * Remove extra line * initial fix of dropping columns * add unit tests * revert sql query test change * revert sql query test change 2 * bump versions * move outside if * Adding a working Docker setup for developing sparkmagic (jupyter-incubator#361) * Adding a working Docker setup for developing sparkmagic It includes the Jupyter notebook as well as the Livy+Spark endpoint. Documentation is in the README * Pre-configure the ~/.sparkmagic/config.json Now you can just launch a PySpark wrapper kernel and have it work out of the box. * Add R to Livy container Also added an R section to example_config.json to make it work out of the box - and I think it's just a good thing to have it anyway, otherwise how would users ever know it was meant to be there? * Add more detail to the README container section * Add dev_mode build-arg. Disabled by default. When enabled, builds the container using your local copy of sparkmagic, so that you can test your development changes inside the container. * Adding missing kernels Was missing Scala and Python2. Confirmed that Python2 and Python3 are indeed separate environments on the spark container. * Kerberos authentication support (jupyter-incubator#355) * Enabled kerberos authentication on sparkmagic and updated test cases. * Enabled hide and show username/password based on auth_type. * Updated as per comments. * Updated documentation for kerberos support * Added test cases to test backward compatibility of auth in handlers * Update README.md Change layout and add build status * Bump version to 0.12.0 (jupyter-incubator#365) * Remove extra line * bump version * Optional coerce (jupyter-incubator#367) * Remove extra line * added optional configuration to have optional coercion * fix circular dependency between conf and utils * add gcc installation for dev build * fix parsing bug for coerce value * fix parsing bug for coerce value 2 * Automatically configure wrapper-kernel endpoints in widget (jupyter-incubator#362) * Add pre-configured endpoints to endpoint widget automatically * Fix crash on partially-defined kernel configurations * Use LANGS_SUPPORTED constant to get list of possible kernel config sections * Rename is_default attr to implicitly_added * Adding blank line between imports and class declaration * Log failure to connect to implicitly-defined endpoints * Adding comment explaining implicitly_added * Pass auth parameter through * Fix hash and auth to include auth parameter (jupyter-incubator#370) * Fix hash and auth to include auth parameter * fix endpoint validation * remove unecessary commit * Ability to add custom headers to HTTP calls (jupyter-incubator#371) * Abiulity to add custom headers to rest call * Fix import * Ad basic conf test * Fix tests * Add test * Fix tests * Fix indent * Addres review comments * Add custom headers to example config * Merge master to release (jupyter-incubator#390) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Merge master to release (jupyter-incubator#392) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Try to fix pypi repos (jupyter-incubator#391) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * Merge master to release (jupyter-incubator#394) * Configurable retry for errors (jupyter-incubator#378) * Remove extra line * bumping versions * configurable retry * fix string * Make statement and session waiting more responsive (jupyter-incubator#379) * Remove extra line * bumping versions * make sleeping for sessions an exponential backoff * fix bug * Add vscode tasks (jupyter-incubator#383) * Remove extra line * bumping versions * add vscode tasks * Fix endpoints widget when deleting a session (jupyter-incubator#389) * Remove extra line * bumping versions * add vscode tasks * fix deleting from endpoint widget, add notebooks to docker file, refresh correctly, populate endpoints correctly * fix tests * add unit tests * refresh after cleanup * Try to fix pypi repos (jupyter-incubator#391) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * Test 2.7.13 environment for pypi push to prod (jupyter-incubator#393) * Remove extra line * bumping versions * add vscode tasks * try to fix pypi new repos * try to fix pip push for prod pypi by pinning to later version of python * bump versions (jupyter-incubator#395) * Release v0.12.6 (jupyter-incubator#481) * Add python3 option in %manage_spark magic (jupyter-incubator#427) Fixes jupyter-incubator#420 * Links fixed in README * DataError in Pandas moved from core.groupby to core.base (jupyter-incubator#459) * DataError in Pandas moved from core.groupby to core.base * maintain backwards compatability with Pandas 0.22 or lower for DataError * Bump autoviz version to 0.12.6 * Fix unit test failure caused by un-spec'ed mock which fails traitlet validation (jupyter-incubator#480) * Fix failing unit tests Caused by an un-spec'ed mock in a test which fails traitlet validation * Bump travis.yml Python3 version to 3.6 Python 3.3 is not only EOL'ed but is now actively unsupported by Tornado, which causes the Travis build to fail again. * Bumping version numbers for hdijupyterutils and sparkmagic to keep them in sync * add magic for matplotlib display * repair * Patch SparkMagic for latest IPythonKernel compatibility **Description** * The IPython interface was updated to return an asyncio.Future rather than a dict from version 5.1.0. This broke SparkMagic as it still expects a dictionart from the output * This change updates the SparkMagic base kernel to expect a Future and block on its result. * This also updates the dependencies to call out the the new IPython version dependency. **Testing Done** * Unit tests added * Validating that the kernel connects successfully * Validating some basic Spark additional operations on an EMR cluster. * Fix decode json error at trailing empty line (jupyter-incubator#483) * Bump version number to 0.12.7 * add a screenshot of an example for display matplot picture * Fix guaranteed stack trace * Simplify loop a bit * We want to be able to interrupt the sleep, so move that outside the try / except * Add missing session status to session. * Correct to correct URL with full list. * Better tests. * Switch to Livy 0.6. * Sketch of removal of PYSPARK3. * Don't allow selection of Python3, since it's not a separate thing now. * __repr__ for easier debugging of test failures. * Start fixing tests. * Rip out more no-longer-relevant "Python 3" code. Python 3 and Python 2 work again. * Changelog. * Add progress bar to sparkmagic/sparkmagic/livyclientlib/command.py. Tested with livy 0.4-0.6, python2 and python3 * Support Future and non-Future results from ipykernel. * News entry. * Unpin ipykernel so it works with Python 2. * Python 3.7 support. * Also update requirements.txt. * Xenial has 3.7. * from IPython.display import display to silence travis warning * Couple missing entries. * Update versions. * Document release process, as I understand it. * Correct file name. * delete obsolete pyspark3kernel (jupyter-incubator#549) * delete obsolete pyspark3kernel * Update README.md * Update setup.py * Update test_kernels.py * Remove old kernelspec installation from Dockerfile This kernel was removed in jupyter-incubator#549 but the Dockerfile still tries to install it, which fails the build. This corrects that. * Relax constraints even more, and make sure to relax them in duplicate locations. * Don't assume some pre-populated tables, create a new table from the Docker image's examples. * Note new feature. * Additional dependencies for matplotlib to work. * Add support and documentation for extension use, refactor kernel use. * Example in pyspark kernel. * Test for Command's decoding of images. * Switch to plotly 3. * Try to switch to standard mechanism Sparkmagic uses for displaying. * Another entry. * Add documentation for JupyterLab. * Prepare for 0.12.9. * Revert get_session_kind change to be more consistent with upstream repo. * Remove redundant python3 session test. * Remove python3 references in livysession.
It includes the Jupyter notebook as well as the Livy+Spark endpoint.
Documentation is in the README.
Tested and works on my own machine (which, since this is Docker, means it should work anywhere). You just
docker-compose build && docker-compose up
and then launch http://localhost:8888 and add http://spark:8998 as a Livy endpoint from%manage_spark
.