-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON encoding refactor and orjson encoding #2955
Conversation
Later we can use this to configure base64 encoding
Cool :) Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something? |
Looking at https://github.com/ijl/orjson#why-cant-i-install-it-from-pypi, we should verify that we can install this on <=4.0.1 versions of DE with older |
I haven't looked into it deeply, but it has native components and isn't available in the main anaconda channel yet (it is on conda-forge though). If we did make it a hard dependency for 5.0, then it probably would get added to the main anaconda channel as well 🙂 So, adding it as a hard dependency carries the risk of breakage for some folks.
Yeah, it'll have the same problem we've hit elsewhere with older versions of pip. |
Tests written and ready for review |
- pio.json.to_plotly_json -> pio.json.to_json_plotly - pio.json.from_plotly_json -> pio.json.from_json_plotly
I'm not sure I understand why this change and what the implications are...?
Is this to avoid the legacy encoder from calling the new encoder while recursing through a structure that contains a figure? |
This naming change was all internal to this PR. Before this PR, there was only I extracted the JSON encoding into |
I think the plotlyjs_dev_build failure is due to the |
since it's sometimes slower, and never faster, than current encoder. Rename "legacy" to "json".
Up updated the overview comment with a description of the correctness testing, and the benchmarking results, obtained using the #3012 branch on the plotly.py documentation examples. |
Just wanted to add my opinion to not to make orjson a hard dependency. orjson is written in Rust (apparently depending on some recent version to build), which is a barrier for running plotly.py on platforms that Rust has trouble compiling to.
This is not an objection to this pull request. In fact, the preferred solution is to get orjson working in Pyodide (since there is already a need for fast serialization to communicate between JS and Python). |
Thanks for the feedback and perspective @jmsmdy. That all makes a lot of sense. This PR did end up making orjson optional, and that will be the case going forward. |
Just to refresh my memory: with this PR in non- |
Also note to self to re-think-about the comment in #2880 (comment) |
# Conflicts: # packages/python/plotly/plotly/io/_json.py # packages/python/plotly/tox.ini
Regarding the numpy floating point precision and that This had always bugged me, as it resulted in much larger exports (i.e. html / ipynb file sizes) than necessary (when float16 or float32 is sufficient) and affected not only coordinate data, but also marker sizes, meta info, etc. Just in case the plotly.py devs or others are interested: I had found a way to avoid this number inflation by modifying (& monkey patching) the @staticmethod
def encode_as_list_patch(obj):
"""Attempt to use `tolist` method to convert to normal Python list."""
if hasattr(obj, "tolist"):
numpy = get_module("numpy")
try:
if isinstance(obj, numpy.ndarray) \
and obj.dtype == numpy.float32 or obj.dtype == numpy.float16 \
and obj.flags.contiguous:
return [float('%s' % x) for x in obj]
except AttributeError:
raise NotEncodable
return obj.tolist()
else:
raise NotEncodable It's about 30-50x slower than I always wanted to report this, and this PR revived the topic. Could this be relevant for a new issue (especially since orjson will not become the default)? FYI: for reference, a quick search revealed that a patch of encode_as_list was already suggested before: #1842 (comment), in the context of treating inf & NaN, which got brought up again in #2880 (comment). |
@mherrmann3 thanks! I've broken this out into a separate issue: #3232 |
I'm super keen to give this a go as I've got medium sized data and am having performance issues :( It sounds like Assuming that gives a good speedup, is there a way to configure |
Not right now, the logic around the JSON library needs to be customized a bit (e.g. for handling datetime formatting). That said, the refactoring that went into the orjson support, and the switchable JSON engines, would make it a lot easier to add support for additional JSON libraries in the future. |
Overview
Initial implementation of the idea from #2944 of refactoring the JSON encoding pipeline and optionally performing JSON encoding with orjson.
orjson is impressively fast, and it includes built-in support for numpy arrays which is many times faster than the current approach of converting them to lists before encoding.
Also, orjson automatically converts all non-finite values to JSON
null
values, so we don't need workarounds like re-encoding as discussed in #2880.JSON config object
To configure the JSON encoding engine, this PR adds a
plotly.io.json.config
object that mirrorsplotly.io.orca.config
andplotly.io.kaleido.config
. Currently the only option isdefault_engine
which can be set to "json" for the current encoder based onPlotlyJSONEncoder
or "orjson" which is pretty much always much faster.The
to_json
/write_json
also provide an engine argument to override the default.To try it out, install orjson with pip
or conda
Then configure plotly to use it with
Quick timing example
Then time the encoding speed
In this large figure case of encoding a figure with four one-million element arrays, the orjson encoding is 12x faster on my machine!
Relationship to base64 encoding
This approach is fully compatible with the base64 encoding work in #2943. I think we should focus on this approach first because there are substantial performance gains to be had without changing the schema of the resulting JSON and (hopefully) without requireing changes in Plotly.js
After this is merged, we can add base64 encoding on top of it for additional performance improvements.
Correctness testing
In addition to adding a new test suite, this PR has been tested against all of the documentation figures using the slightly modified instrumentation branch at #3012. This branch executes all json encoding requests using both the
json
andorjson
encoders, and checks that the encoded string are identical.There are two documentation examples that fail this test:
imshow
andml-tsne-umap-projections
. Both of these fail due to difference in how the orjson encoding handles floating-point numbers with precision less than 64 bits (see next section)Numpy floating point preceision
The "json" encoder handles numpy arrays by first converting them to lists, and then encoding the lists. All numpy floating point types are converted to 64-bit python float values. For floating point numpy arrays with less than 64 bit precision, converting them to 64 bits before encoding artificially increases the precision of the values in the array, but this is really the only option.
The orjson encoder accepts numpy arrays directly, and it will output the appropriate amount of decimal places for precision of the input array.
So the encoded JSON values between the legacy
json
encoder and theorjson
encoder will not agree when encoding floating point numpy arrays with less than 64-bit precision. This is the only known discrepancy.Benchmarking
To compare the performance of the
orjson
encoder against the legacyjson
, #3012 records the encoding time for both encoders and writes the results to a file. Here are plots of the relative timing results across all of the figures in the plotly.py documentationNote that ``length` here is the number of characters is the encoded JSON string
the
orjson
encoder is almost always faster (up to 40x in one case). The handful of cases that have equivalent or slower performance are cases that include values that are not natively supported by orjson (e.g.pd.Timestamp
orPIL.Image
objects), and that don't have sizable numpy arrays.All of the cases where it's slower run in less than half a millisecond.
My conclusion is that defaulting to
orjson
when the package is installed is a safe default that will almost always improve performance.TODO