Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON encoding refactor and orjson encoding #2955

Merged
merged 49 commits into from
May 27, 2021
Merged

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Dec 5, 2020

Overview

Initial implementation of the idea from #2944 of refactoring the JSON encoding pipeline and optionally performing JSON encoding with orjson.

orjson is impressively fast, and it includes built-in support for numpy arrays which is many times faster than the current approach of converting them to lists before encoding.

Also, orjson automatically converts all non-finite values to JSON null values, so we don't need workarounds like re-encoding as discussed in #2880.

JSON config object

To configure the JSON encoding engine, this PR adds a plotly.io.json.config object that mirrors plotly.io.orca.config and plotly.io.kaleido.config. Currently the only option is default_engine which can be set to "json" for the current encoder based on PlotlyJSONEncoder or "orjson" which is pretty much always much faster.

The to_json/write_json also provide an engine argument to override the default.

To try it out, install orjson with pip

$ pip install orjson

or conda

$ conda install -c conda-forge orjson

Then configure plotly to use it with

import plotly.io as pio
import numpy as np
import plotly.graph_objects as go

Quick timing example

Then time the encoding speed

N = 1000000
dtype = "float32"
x = np.random.randn(N).astype(dtype)
y = np.random.randn(N).astype(dtype)
size = np.random.rand(N).astype(dtype) * 10
opacity = np.random.rand(N).astype(dtype)
fig = go.Figure(data=[go.Scatter(x=x, y=y, marker_size=size, marker_opacity=opacity)])
%%timeit
res1 = pio.to_json(fig, engine="legacy")
2.06 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
res1 = pio.to_json(fig, engine="orjson")
169 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In this large figure case of encoding a figure with four one-million element arrays, the orjson encoding is 12x faster on my machine!

Relationship to base64 encoding

This approach is fully compatible with the base64 encoding work in #2943. I think we should focus on this approach first because there are substantial performance gains to be had without changing the schema of the resulting JSON and (hopefully) without requireing changes in Plotly.js

After this is merged, we can add base64 encoding on top of it for additional performance improvements.

Correctness testing

In addition to adding a new test suite, this PR has been tested against all of the documentation figures using the slightly modified instrumentation branch at #3012. This branch executes all json encoding requests using both the json and orjson encoders, and checks that the encoded string are identical.

There are two documentation examples that fail this test: imshow and ml-tsne-umap-projections. Both of these fail due to difference in how the orjson encoding handles floating-point numbers with precision less than 64 bits (see next section)

Numpy floating point preceision

The "json" encoder handles numpy arrays by first converting them to lists, and then encoding the lists. All numpy floating point types are converted to 64-bit python float values. For floating point numpy arrays with less than 64 bit precision, converting them to 64 bits before encoding artificially increases the precision of the values in the array, but this is really the only option.

The orjson encoder accepts numpy arrays directly, and it will output the appropriate amount of decimal places for precision of the input array.

So the encoded JSON values between the legacy json encoder and the orjson encoder will not agree when encoding floating point numpy arrays with less than 64-bit precision. This is the only known discrepancy.

Benchmarking

To compare the performance of the orjson encoder against the legacy json, #3012 records the encoding time for both encoders and writes the results to a file. Here are plots of the relative timing results across all of the figures in the plotly.py documentation

Note that ``length` here is the number of characters is the encoded JSON string

newplot (11)

newplot (12)

the orjson encoder is almost always faster (up to 40x in one case). The handful of cases that have equivalent or slower performance are cases that include values that are not natively supported by orjson (e.g. pd.Timestamp or PIL.Image objects), and that don't have sizable numpy arrays.

All of the cases where it's slower run in less than half a millisecond.

My conclusion is that defaulting to orjson when the package is installed is a safe default that will almost always improve performance.

TODO

  • The current tests are pretty much passing with the new encoder, but before merging I want to add some more tests specifically around JSON encoding, especially with dates and datetimes. And make sure that all of the encoding engines are tested.

@jonmmease jonmmease marked this pull request as draft December 5, 2020 20:24
@nicolaskruchten
Copy link
Contributor

Cool :)

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

@chriddyp
Copy link
Member

chriddyp commented Dec 7, 2020

Looking at https://github.com/ijl/orjson#why-cant-i-install-it-from-pypi, we should verify that we can install this on <=4.0.1 versions of DE with older pip

@jonmmease
Copy link
Contributor Author

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

I haven't looked into it deeply, but it has native components and isn't available in the main anaconda channel yet (it is on conda-forge though). If we did make it a hard dependency for 5.0, then it probably would get added to the main anaconda channel as well 🙂

So, adding it as a hard dependency carries the risk of breakage for some folks.

we should verify that we can install this on <=4.0.1 versions of DE with older pip

Yeah, it'll have the same problem we've hit elsewhere with older versions of pip.

@jonmmease jonmmease marked this pull request as ready for review December 31, 2020 17:49
@jonmmease
Copy link
Contributor Author

Tests written and ready for review

    - pio.json.to_plotly_json -> pio.json.to_json_plotly
    - pio.json.from_plotly_json -> pio.json.from_json_plotly
@nicolaskruchten
Copy link
Contributor

I'm not sure I understand why this change and what the implications are...?

Renamed pio.json.to_plotly_json to pio.json.to_json_plotly because we, unfortunately, use fig.to_plotly_json as a method that converts a graph_object (or Dash component) to a dictionary.

Is this to avoid the legacy encoder from calling the new encoder while recursing through a structure that contains a figure?

@jonmmease
Copy link
Contributor Author

Renamed pio.json.to_plotly_json to pio.json.to_json_plotly because we, unfortunately, use fig.to_plotly_json as a method that converts a graph_object (or Dash component) to a dictionary.

This naming change was all internal to this PR. Before this PR, there was only to_json which both did figure validation and performed JSON encoding.

I extracted the JSON encoding into to_json_plotly, which is now called by to_json. to_json_plotly doesn't perform figure validation, and is what Dash will use for JSON encoding.

@jonmmease
Copy link
Contributor Author

I think the plotlyjs_dev_build failure is due to the role removal on plotly.js master

since it's sometimes slower, and never faster, than current encoder.

Rename "legacy" to "json".
@jonmmease
Copy link
Contributor Author

Up updated the overview comment with a description of the correctness testing, and the benchmarking results, obtained using the #3012 branch on the plotly.py documentation examples.

@jmsmdy
Copy link
Contributor

jmsmdy commented Mar 16, 2021

Why not just swap this out right now and make it a hard dependency? Is there a specific risk or subdependency we don't want to pull in or something?

Just wanted to add my opinion to not to make orjson a hard dependency. orjson is written in Rust (apparently depending on some recent version to build), which is a barrier for running plotly.py on platforms that Rust has trouble compiling to.

  • There were recently issues with orjson on the new apple silicon (see: Apple Silicon Binaries ijl/orjson#155). These have been resolved, but these issues would have held up people trying to use plotly on the new macs had orjson been a hard dependency.

  • orjson does not compile to WebAssembly. After the "retrying" dependency is replaced by "tenacity" (when this pull request is merged: Replaced 'retrying' dependency with 'tenacity' in plotly package #2911), all (hard) dependencies of plotly.py will be available as universal wheels on PyPi, which enables plotly.py to run on Pyodide (a version of CPython compiled to run in WebAssembly). This would break if orjson were made a hard dependency.

This is not an objection to this pull request. In fact, the preferred solution is to get orjson working in Pyodide (since there is already a need for fast serialization to communicate between JS and Python).

@jonmmease
Copy link
Contributor Author

Thanks for the feedback and perspective @jmsmdy. That all makes a lot of sense. This PR did end up making orjson optional, and that will be the case going forward.

@nicolaskruchten
Copy link
Contributor

Just to refresh my memory: with this PR in non-orjson mode, do we still do the trick from #2880 that got us a nice performance boost in some cases?

@nicolaskruchten
Copy link
Contributor

Also note to self to re-think-about the comment in #2880 (comment)

# Conflicts:
#	packages/python/plotly/plotly/io/_json.py
#	packages/python/plotly/tox.ini
@nicolaskruchten nicolaskruchten merged commit 5301dcb into master May 27, 2021
@mherrmann3
Copy link

Regarding the numpy floating point precision and that PlotlyJSONEncoder always casts those to float64 due to using tolist()...

This had always bugged me, as it resulted in much larger exports (i.e. html / ipynb file sizes) than necessary (when float16 or float32 is sufficient) and affected not only coordinate data, but also marker sizes, meta info, etc.

Just in case the plotly.py devs or others are interested: I had found a way to avoid this number inflation by modifying (& monkey patching) the encode_as_list method:

@staticmethod
def encode_as_list_patch(obj):
    """Attempt to use `tolist` method to convert to normal Python list."""
    if hasattr(obj, "tolist"):

        numpy = get_module("numpy")
        try:
            if isinstance(obj, numpy.ndarray) \
               and obj.dtype == numpy.float32 or obj.dtype == numpy.float16 \
               and obj.flags.contiguous:
                return [float('%s' % x) for x in obj]
        except AttributeError:
            raise NotEncodable

        return obj.tolist()
    else:
        raise NotEncodable

It's about 30-50x slower than .tolist(), but - being in the order of a few μs - still much faster than the json encoding, with the benefit of ~3x smaller exports.

I always wanted to report this, and this PR revived the topic. Could this be relevant for a new issue (especially since orjson will not become the default)?

FYI: for reference, a quick search revealed that a patch of encode_as_list was already suggested before: #1842 (comment), in the context of treating inf & NaN, which got brought up again in #2880 (comment).

@nicolaskruchten
Copy link
Contributor

@mherrmann3 thanks! I've broken this out into a separate issue: #3232

@dhirschfeld
Copy link

I'm super keen to give this a go as I've got medium sized data and am having performance issues :(

It sounds like orjson will be used automatically if installed:

image

Assuming that gives a good speedup, is there a way to configure plotly to automatically use a different json parser - e.g. pysimdjson?

@jonmmease
Copy link
Contributor Author

is there a way to configure plotly to automatically use a different json parser

Not right now, the logic around the JSON library needs to be customized a bit (e.g. for handling datetime formatting). That said, the refactoring that went into the orjson support, and the switchable JSON engines, would make it a lot easier to add support for additional JSON libraries in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants