-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Druid API for use in notebooks #13787
Conversation
Revises existing notebooks and readme to reference the new API. Notebook to explain the new API. Split README into a console version and a notebook version to work around lack of a nice display for md files.
Update the REST API notebook to use simpler Requests calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @paul-rogers ! Mostly requesting style changes, but a couple technical questions here & there.
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"druid = druidapi.client('http://localhost:8888')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this throw an error if it doesn't connect to druid. If I don't have Druid running, it doesn't let me know something's amiss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another step, a cell or two down, that shows how to wait for the cluster to become ready. We could create a new method that does both of these steps.
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
" WHERE TABLE_SCHEMA = 'druid'\n", | ||
" ORDER BY TABLE_NAME\n", | ||
" \n", | ||
"POST: http://localhost:8888/druid/v2/sql\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise to questions. First, clearing the notebook. I agree that we should normally do that. Since this one is a quick intro to the API, I experimented with leaving the output as part of the reference so the user doesn't have to step through the notebook to see what the various methods do. I can be persuaded otherwise, especially if we eventually create real documentation for the library. (There are tools to generate documentation from the Python code itself.)
On the rendering, yes, it is odd. The reason is that the above prints a string that has embedded newlines and spaces because the SQL statement itself has these. It looks a bit odd, but it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some wording & style suggestions
Converted the SQL tutorial to use the Python library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@techdocsmith and @writer-jill, thanks much for the review. I believe I addressed all the edits: please double check to be certain.
I couldn't resist the temptation to convert the SQL tutorial to use the Python library. Try it out, I think the result is a better focus on the great SQL explanation and fewer fiddly bits.
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the API client code LGTM. Just had a few questions and some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the Python API notebook. It's great.
Wondering if there's a case for starting streaming from a notebook. Maybe a data scientist in dev phase wanting to transform a stream with query results from Druid and adding ML results back to a stream. Not sure if having such an API would help them much more than just using Druid UI to start the stream. Perhaps if it were designed to be really easy. Like "start_stream( kafka_broker, topic, datasource)"... with schemaless, I think that would be pretty useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on the python APIs - this would be a good addition for folks to try out Druid programmatically and test things. I mostly reviewed the python bits and left a few comments. Thanks!
- An available Druid instance. You can use the [Quickstart (local)](./index.md) instance. The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned. | ||
One of the notebooks shows how to use the Druid REST API. The others focus on other | ||
topics and use a simple set of Python wrappers around the underlying REST API. The | ||
wrappers reside in the `druidapi` package within the notebooks directory. While the package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if it'd make sense to pull the python package outside the context of the jupyter notebook so it can be reused for other things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good longer-term goal. For now, we're putting our toe in the water by including the code here. From there, we can see if there is broader interest besides just as a training tool.
Fixes from automated tests
There probably is a case for this. The thought is to start small, then expand to cover more of Druid's REST API. As you can see, just reviewing this partial subset is a bit of a project! A thought is that, as the Doc folks add tutorials, we'll naturally need to add more to the API. Also once this library is available, others can add to it to wrap APIs they need. |
Broke display-related operations into a display client Refactored HTML support Adjusted notebooks based on the above A README file explains the druidapi library. Used the Python property feature for zero-argument getters Various bug fixes
Did some last-minute cleanup before this library goes public:
The result is that the functionality is the same, but a few operations moved around. For example, to display the results of a SQL statement was |
drive by comment: maybe there should be tests for this library if we are going to encourage people to use it? (not to say that this necessarily needs to block merging) we do have some python tests that run for the old 'should we run this test' script that travis was using, https://github.com/apache/druid/blob/master/.github/workflows/static-checks.yml#L64 which runs https://github.com/apache/druid/blob/master/check_test_suite_test.py (not sure this is useful) |
@techdocsmith, @vtlim, @sergioferragut - I addressed your comments, and improved the code a bit. Please do a final review at your leisure. We can then get this code in and continue to iterate when the code is in a form we can all access. |
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
# The following is hack to check for null values, empty strings and numeric 0s. | ||
is_null[key] = not not value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat 😮
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
@clintropolis, it turns out that there is a test suite that I run locally. We can add that in a follow-on PR. This one is large enough (and being reviewed by our doc contributors!) that I didn't want to add complexity. We've also got to sort out packaging, which would be a good time to figure out where the test should live. The tests need a server. On a separate PR, I'm working on extending the "new" ITs to allow tests to live in modules other than the |
@vtlim, thanks much for your thorough copy editing! Fixed all the corrections. @techdocsmith, you marked this PR as "Changes requested". Can you take another look to see if I've addressed your requests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes! Looks like there are some missed comments on Python_API_Tutorial.ipynb
. Also came across a few more typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paul-rogers i think we're almost there. Looks like there are empty code blocks at the end of all three notebooks in this request. With that and changes requested from @vtlim I think we're ready.
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
commit suggestions Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python side LGTM 🥳
Python Druid API for use in notebooks Revises existing notebooks and readme to reference the new API. Notebook to explain the new API. Split README into a console version and a notebook version to work around lack of a nice display for md files. Update the REST API notebook to use simpler Requests calls Converted the SQL tutorial to use the Python library README file, converted to using properties --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com> Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@paul-rogers can you provide some context on how this might overlap with the pydruid project and whether we considered joining forces with the maintainers? https://github.com/druid-io/pydruid |
@xvrl, thanks for the question. It seems that PyDruid has historically focused on queries: primarily for native queries, also for SQL. PyDruid has great integrations with things like SqlAlchemy, Pandas, etc. The Python API here is primarily for the operational aspects of the Druid REST API. It allows those of us who tinker with Druid to create notebooks to drive various Druid tasks. This API also has query features, including those not (yet) supported in PyDruid, such as support for MSQ, both for ingestion and for query. (MSQ query, at the moment, is a bit awkward, but it does work.) A way to combine efforts would be to expand PyDruid to offer the non-query features in this API, as well as including (experimental) MSQ query support. Since PyDruid has historically been query-focused, there could be two "PyDruid+" modules: one for the classic query tasks, another for the other Druid REST API features. Combining the two APIs would certainly make life simpler for users: just one library to import rather than two. How do you suggest we might proceed? |
We recently introduced a Jupyter notebook to explain how to use the Druid REST API. That notebook works though the detailed, low-level steps to use the Python
requests
package to make selected Druid REST calls. The result is excellent for explaining the mechanics of each REST call, but a bit fiddly if we want to create a notebook that explains Druid concepts (rather than REST concepts.)In response, this PR introduces a Python-based Druid API (or SDK) that handles the boilerplate of using
requests
to call Druid and parse the JSON results. What was many lines of Python code at the REST level is now a single line of code at the Druid SDK level.The PR includes an "intro to the Python API" notebook which is not an exhaustive reference. Rather, it gives just enough information to see how to use the API, and explains how to get the details. A later PR will include a notebook that uses this API to explain a new feature.
The PR also revises existing notebooks and the README to reference the new API. The previous
README.md
didn't play when viewed in Jupyter: it seems Jupyter doesn't know how to render such files (oddly). Instead, the file was split into a "pre-Juypter" README that explains how to get started, and a post-Jupyter "- START HERE -" notebook that has the rest of the information once Jupyter is up and running.Updated the doc files accordingly.
Note that this PR changes no "production" code: everything is confined to the
quickstart
examples.Release note
This release extends the tutorial Jupyter notebooks to add a simple Python API to access Druid, along with a notebook that explains how to use that API.
This PR has: