Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Druid API for use in notebooks #13787

Merged
merged 23 commits into from
Mar 5, 2023
Merged

Conversation

paul-rogers
Copy link
Contributor

We recently introduced a Jupyter notebook to explain how to use the Druid REST API. That notebook works though the detailed, low-level steps to use the Python requests package to make selected Druid REST calls. The result is excellent for explaining the mechanics of each REST call, but a bit fiddly if we want to create a notebook that explains Druid concepts (rather than REST concepts.)

In response, this PR introduces a Python-based Druid API (or SDK) that handles the boilerplate of using requests to call Druid and parse the JSON results. What was many lines of Python code at the REST level is now a single line of code at the Druid SDK level.

The PR includes an "intro to the Python API" notebook which is not an exhaustive reference. Rather, it gives just enough information to see how to use the API, and explains how to get the details. A later PR will include a notebook that uses this API to explain a new feature.

The PR also revises existing notebooks and the README to reference the new API. The previous README.md didn't play when viewed in Jupyter: it seems Jupyter doesn't know how to render such files (oddly). Instead, the file was split into a "pre-Juypter" README that explains how to get started, and a post-Jupyter "- START HERE -" notebook that has the rest of the information once Jupyter is up and running.

Updated the doc files accordingly.

Note that this PR changes no "production" code: everything is confined to the quickstart examples.

Release note

This release extends the tutorial Jupyter notebooks to add a simple Python API to access Druid, along with a notebook that explains how to use that API.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.

Revises existing notebooks and readme to reference
the new API.

Notebook to explain the new API.

Split README into a console version and a notebook
version to work around lack of a nice display for
md files.
Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @paul-rogers ! Mostly requesting style changes, but a couple technical questions here & there.

examples/quickstart/jupyter-notebooks/-START HERE-.ipynb Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/-START HERE-.ipynb Outdated Show resolved Hide resolved
docs/tutorials/tutorial-jupyter-index.md Outdated Show resolved Hide resolved
"metadata": {},
"outputs": [],
"source": [
"druid = druidapi.client('http://localhost:8888')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw an error if it doesn't connect to druid. If I don't have Druid running, it doesn't let me know something's amiss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another step, a cell or two down, that shows how to wait for the cluster to become ready. We could create a new method that does both of these steps.

" WHERE TABLE_SCHEMA = 'druid'\n",
" ORDER BY TABLE_NAME\n",
" \n",
"POST: http://localhost:8888/druid/v2/sql\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should normally clear the outputs before we check in, but this is rendering a little strangely for me. Not sure if we can add an additional newline to the library to give some space between the trace
Screen Shot 2023-02-21 at 3 23 43 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise to questions. First, clearing the notebook. I agree that we should normally do that. Since this one is a quick intro to the API, I experimented with leaving the output as part of the reference so the user doesn't have to step through the notebook to see what the various methods do. I can be persuaded otherwise, especially if we eventually create real documentation for the library. (There are tools to generate documentation from the Python code itself.)

On the rendering, yes, it is odd. The reason is that the above prints a string that has embedded newlines and spaces because the SQL statement itself has these. It looks a bit odd, but it works.

examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@writer-jill writer-jill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some wording & style suggestions

docs/tutorials/tutorial-jupyter-index.md Outdated Show resolved Hide resolved
docs/tutorials/tutorial-jupyter-index.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb Outdated Show resolved Hide resolved
Converted the SQL tutorial to use the Python library
Copy link
Contributor Author

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@techdocsmith and @writer-jill, thanks much for the review. I believe I addressed all the edits: please double check to be certain.

I couldn't resist the temptation to convert the SQL tutorial to use the Python library. Try it out, I think the result is a better focus on the great SQL explanation and fewer fiddly bits.

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the API client code LGTM. Just had a few questions and some nits.

examples/quickstart/jupyter-notebooks/druidapi/catalog.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/tasks.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/tasks.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/tasks.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/tasks.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/util.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sergioferragut sergioferragut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the Python API notebook. It's great.

Wondering if there's a case for starting streaming from a notebook. Maybe a data scientist in dev phase wanting to transform a stream with query results from Druid and adding ML results back to a stream. Not sure if having such an API would help them much more than just using Druid UI to start the stream. Perhaps if it were designed to be really easy. Like "start_stream( kafka_broker, topic, datasource)"... with schemaless, I think that would be pretty useful.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the python APIs - this would be a good addition for folks to try out Druid programmatically and test things. I mostly reviewed the python bits and left a few comments. Thanks!

docs/tutorials/tutorial-jupyter-index.md Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
- An available Druid instance. You can use the [Quickstart (local)](./index.md) instance. The tutorials assume that you are using the quickstart, so no authentication or authorization is expected unless explicitly mentioned.
One of the notebooks shows how to use the Druid REST API. The others focus on other
topics and use a simple set of Python wrappers around the underlying REST API. The
wrappers reside in the `druidapi` package within the notebooks directory. While the package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if it'd make sense to pull the python package outside the context of the jupyter notebook so it can be reused for other things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good longer-term goal. For now, we're putting our toe in the water by including the code here. From there, we can see if there is broader interest besides just as a training tool.

docs/tutorials/tutorial-jupyter-index.md Outdated Show resolved Hide resolved
@paul-rogers
Copy link
Contributor Author

@sergioferragut:

Wondering if there's a case for starting streaming from a notebook.

There probably is a case for this. The thought is to start small, then expand to cover more of Druid's REST API. As you can see, just reviewing this partial subset is a bit of a project! A thought is that, as the Doc folks add tutorials, we'll naturally need to add more to the API. Also once this library is available, others can add to it to wrap APIs they need.

Broke display-related operations into a display client
Refactored HTML support
Adjusted notebooks based on the above
A README file explains the druidapi library.
Used the Python property feature for zero-argument getters
Various bug fixes
@paul-rogers
Copy link
Contributor Author

paul-rogers commented Feb 27, 2023

Did some last-minute cleanup before this library goes public:

  • Refactored the "display" items in the API into a separate display client. This makes clear which methods are for use in a notebook (because they display an HTML table) and which are just plain old Python methods.
  • Refactored HTML support to be cleaner and better separated from the rest of the code.
  • Simplified the lines needed to use the library by building the Juypter-specific items into a new jupyter_client() method.
  • Used the Python @property annotation to simplify zero-argument getters: druid.sql rather than druid.sql(), say.
  • Created a test suite which is not yet part of this PR. Running that suite found several bugs and several areas that could e simpler.
  • Updated notebooks to reflect the code changes.
  • Reran the druidapi based notebooks to verify things work.
  • Added a README.md file for the library. It is a bit minimal at present, but is at least a start.

The result is that the functionality is the same, but a few operations moved around. For example, to display the results of a SQL statement was sql_client.show(sql). It is now display_client.sql(sql). And so on.

@clintropolis
Copy link
Member

drive by comment: maybe there should be tests for this library if we are going to encourage people to use it? (not to say that this necessarily needs to block merging)

we do have some python tests that run for the old 'should we run this test' script that travis was using, https://github.com/apache/druid/blob/master/.github/workflows/static-checks.yml#L64 which runs https://github.com/apache/druid/blob/master/check_test_suite_test.py (not sure this is useful)

@paul-rogers
Copy link
Contributor Author

@techdocsmith, @vtlim, @sergioferragut - I addressed your comments, and improved the code a bit. Please do a final review at your leisure. We can then get this code in and continue to iterate when the code is in a form we can all access.

docs/tutorials/tutorial-jupyter-index.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/rest.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/rest.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/rest.py Outdated Show resolved Hide resolved
Comment on lines +144 to +145
# The following is hack to check for null values, empty strings and numeric 0s.
is_null[key] = not not value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat 😮

examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/sql.py Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
@paul-rogers
Copy link
Contributor Author

paul-rogers commented Feb 28, 2023

@clintropolis, it turns out that there is a test suite that I run locally. We can add that in a follow-on PR. This one is large enough (and being reviewed by our doc contributors!) that I didn't want to add complexity. We've also got to sort out packaging, which would be a good time to figure out where the test should live.

The tests need a server. On a separate PR, I'm working on extending the "new" ITs to allow tests to live in modules other than the cases module. The thought is to use that feature to run the Python unit tests for this library against a running IT cluster. The tests run against such a cluster by hand, so the next step is to do so automatically.

@paul-rogers
Copy link
Contributor Author

@vtlim, thanks much for your thorough copy editing! Fixed all the corrections.

@techdocsmith, you marked this PR as "Changes requested". Can you take another look to see if I've addressed your requests?

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes! Looks like there are some missed comments on Python_API_Tutorial.ipynb. Also came across a few more typos.

examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
examples/quickstart/jupyter-notebooks/druidapi/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-rogers i think we're almost there. Looks like there are empty code blocks at the end of all three notebooks in this request. With that and changes requested from @vtlim I think we're ready.

paul-rogers and others added 2 commits March 2, 2023 13:57
commit suggestions

Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python side LGTM 🥳

@paul-rogers paul-rogers merged commit a580aca into apache:master Mar 5, 2023
@paul-rogers paul-rogers deleted the 230209-py branch March 5, 2023 02:25
317brian pushed a commit to 317brian/druid that referenced this pull request Mar 10, 2023
Python Druid API for use in notebooks

Revises existing notebooks and readme to reference
the new API.

Notebook to explain the new API.

Split README into a console version and a notebook
version to work around lack of a nice display for
md files.

Update the REST API notebook to use simpler Requests calls

Converted the SQL tutorial to use the Python library

README file, converted to using properties

---------

Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
@xvrl
Copy link
Member

xvrl commented May 12, 2023

@paul-rogers can you provide some context on how this might overlap with the pydruid project and whether we considered joining forces with the maintainers? https://github.com/druid-io/pydruid

@paul-rogers
Copy link
Contributor Author

@xvrl, thanks for the question. It seems that PyDruid has historically focused on queries: primarily for native queries, also for SQL. PyDruid has great integrations with things like SqlAlchemy, Pandas, etc.

The Python API here is primarily for the operational aspects of the Druid REST API. It allows those of us who tinker with Druid to create notebooks to drive various Druid tasks. This API also has query features, including those not (yet) supported in PyDruid, such as support for MSQ, both for ingestion and for query. (MSQ query, at the moment, is a bit awkward, but it does work.)

A way to combine efforts would be to expand PyDruid to offer the non-query features in this API, as well as including (experimental) MSQ query support. Since PyDruid has historically been query-focused, there could be two "PyDruid+" modules: one for the classic query tasks, another for the other Druid REST API features. Combining the two APIs would certainly make life simpler for users: just one library to import rather than two.

How do you suggest we might proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants