Skip to content

vernetya/pysimdjson

 
 

Repository files navigation

PyPI - License Tests

pysimdjson

Python bindings for the simdjson project, a SIMD-accelerated JSON parser. If SIMD instructions are unavailable a fallback parser is used, making pysimdjson safe to use anywhere.

Bindings are currently tested on OS X, Linux, and Windows for Python version 3.5 to 3.9.

📝 Documentation

The latest documentation can be found at https://pysimdjson.tkte.ch.

If you've checked out the source code (for example to review a PR), you can build the latest documentation by running cd docs && make html.

🎉 Installation

If binary wheels are available for your platform, you can install from pip with no further requirements:

pip install pysimdjson

Binary wheels are available for the following:

py3.5 py3.6 py3.7 py3.8 py3.9 pypy3
OS X (x86_64) y y y y y y
Windows (x86_64) x x y y y x
Linux (x86_64) y y y y y x
Linux (ARM64) y y y y y x

If binary wheels are not available for your platform, you'll need a C++11-capable compiler to compile the sources:

pip install 'pysimdjson[dev]' --no-binary :all:

Both simdjson and pysimdjson support FreeBSD and Linux on ARM when built from source.

⚗ Development and Testing

This project comes with a full test suite. To install development and testing dependencies, use:

pip install -e ".[dev]"

To also install 3rd party JSON libraries used for running benchmarks, use:

pip install -e ".[benchmark]"

To run the tests, just type pytest. To also run the benchmarks, use pytest --runslow.

To properly test on Windows, you need both a recent version of Visual Studio (VS) as well as VS2015, patch 3. Older versions of CPython required portable C/C++ extensions to be built with the same version of VS as the interpreter. Use the Developer Command Prompt to easily switch between versions.

How It Works

This project uses pybind11 to generate the low-level bindings on top of the simdjson project. You can use it just like the built-in json module, or use the simdjson-specific API for much better performance.

import simdjson
doc = simdjson.loads('{"hello": "world"}')

🚀 Making things faster

pysimdjson provides an api compatible with the built-in json module for convenience, and this API is pretty fast (beating or tying all other Python JSON libraries). However, it also provides a simdjson-specific API that can perform significantly better.

Don't load the entire document

95% of the time spent loading a JSON document into Python is spent in the creation of Python objects, not the actual parsing of the document. You can avoid all of this overhead by ignoring parts of the document you don't want.

pysimdjson supports this in two ways - the use of JSON pointers via at(), or proxies for objects and lists.

import simdjson
parser = simdjson.Parser()
doc = parser.parse(b'{"res": [{"name": "first"}, {"name": "second"}]}')

For our sample above, we really just want the second entry in res, we don't care about anything else. We can do this two ways:

assert doc['res'][1]['name'] == 'second' # True
assert doc.at('res/1/name') == 'second' # True

Both of these approaches will be much faster than using load/s(), since they avoid loading the parts of the document we didn't care about.

Both Object and Array have a mini property that returns their entire content as a minified Python str. A message router for example would only parse the document and retrieve a single property, the destination, and forward the payload without ever turning it into a Python object. Here's a (bad) example:

import simdjson

@app.route('/store', methods=['POST'])
def store():
    parser = simdjson.Parser()
    doc = parser.parse(request.data)
    redis.set(doc['key'], doc.mini)

With this, doc could contain thousands of objects, but the only one loaded into a python object was key, and we even minified the content as we went.

Re-use the parser.

One of the easiest performance gains if you're working on many documents is to re-use the parser.

import simdjson
parser = simdjson.Parser()

for i in range(0, 100):
    doc = parser.parse(b'{"a": "b"}')

This will drastically reduce the number of allocations being made, as it will reuse the existing buffer when possible. If it's too small, it'll grow to fit.

📈 Benchmarks

pysimdjson compares well against most libraries for the default load/loads(), which creates full python objects immediately.

pysimdjson performs significantly better when only part of the document is of interest. For each test file we show the time taken to completely deserialize the document into Python objects, as well as the time to get the deepest key in each file. The second approach avoids all unnecessary object creation.

jsonexamples/canada.json deserialization

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{canada} 10.67130 22.89260 0.00465 60.30257
yyjson-{canada} 11.29230 29.90640 0.00568 53.27890
orjson-{canada} 11.90260 34.88260 0.00507 54.49605
ujson-{canada} 18.17060 48.99410 0.00718 36.24892
simplejson-{canada} 39.24630 52.62860 0.00483 21.81617
rapidjson-{canada} 41.04930 53.10800 0.00445 21.19078
json-{canada} 44.68320 59.44410 0.00440 19.71509

jsonexamples/canada.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{canada} 3.21360 6.88010 0.00044 285.83978
yyjson-{canada} 10.62770 46.10050 0.01000 43.29310
orjson-{canada} 12.54010 39.16080 0.00779 44.28928
ujson-{canada} 17.93980 35.44960 0.00697 36.78481
simplejson-{canada} 38.58160 54.33290 0.00699 21.37382
rapidjson-{canada} 40.69030 58.23460 0.00700 20.30349
json-{canada} 43.88300 65.04480 0.00722 18.55929

jsonexamples/twitter.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{twitter} 2.36070 14.03050 0.00123 346.94307
✨ simdjson-{twitter} 2.41350 12.01550 0.00117 359.49272
yyjson-{twitter} 2.48130 12.03680 0.00112 353.03313
ujson-{twitter} 2.62890 11.39370 0.00090 346.87994
simplejson-{twitter} 3.34600 11.08840 0.00098 270.58797
json-{twitter} 3.35270 11.82610 0.00116 260.01943
rapidjson-{twitter} 4.29320 13.81980 0.00128 197.91107

jsonexamples/twitter.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{twitter} 0.33840 0.67200 0.00002 2800.32496
orjson-{twitter} 2.38460 13.53120 0.00131 352.70788
yyjson-{twitter} 2.48180 13.67470 0.00156 320.56731
ujson-{twitter} 2.65230 11.65150 0.00125 331.69430
json-{twitter} 3.34910 12.44890 0.00116 263.25854
simplejson-{twitter} 3.35760 15.61900 0.00137 262.36758
rapidjson-{twitter} 4.31870 12.77490 0.00119 201.86510

jsonexamples/github_events.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{github_events} 0.18080 0.67020 0.00004 5041.29485
✨ simdjson-{github_events} 0.19470 0.61450 0.00003 4725.63489
yyjson-{github_events} 0.19710 0.53970 0.00004 4584.50870
ujson-{github_events} 0.23760 1.33490 0.00004 3904.08715
json-{github_events} 0.29030 1.32040 0.00009 3034.22530
simplejson-{github_events} 0.30210 0.82260 0.00005 3067.99997
rapidjson-{github_events} 0.33010 0.92400 0.00005 2793.93274

jsonexamples/github_events.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{github_events} 0.03630 0.66110 0.00001 25259.19598
orjson-{github_events} 0.18210 0.71230 0.00003 5073.48086
yyjson-{github_events} 0.20030 0.61270 0.00003 4589.71299
ujson-{github_events} 0.24260 1.05100 0.00007 3644.08240
json-{github_events} 0.29310 2.38770 0.00011 2967.79019
simplejson-{github_events} 0.30580 1.39670 0.00007 2931.01646
rapidjson-{github_events} 0.33340 0.80440 0.00004 2795.27887

jsonexamples/citm_catalog.json deserialization

Name Min (μs) Max (μs) StdDev Ops
orjson-{citm_catalog} 5.40140 17.76900 0.00314 130.33847
yyjson-{citm_catalog} 5.77340 23.09490 0.00421 113.78942
✨ simdjson-{citm_catalog} 6.00620 26.87570 0.00444 104.41073
ujson-{citm_catalog} 6.34300 25.06400 0.00473 96.01414
simplejson-{citm_catalog} 9.54910 23.96350 0.00392 78.99315
json-{citm_catalog} 10.21250 23.52610 0.00329 78.72180
rapidjson-{citm_catalog} 10.81700 21.85400 0.00343 73.94939

jsonexamples/citm_catalog.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{citm_catalog} 0.81040 2.11090 0.00015 1088.17698
orjson-{citm_catalog} 5.37260 18.37890 0.00451 120.86345
yyjson-{citm_catalog} 5.61430 23.18500 0.00548 110.29924
ujson-{citm_catalog} 6.25850 30.79090 0.00604 95.50805
simplejson-{citm_catalog} 9.36560 24.44860 0.00510 77.50571
json-{citm_catalog} 10.07650 25.29490 0.00450 76.18267
rapidjson-{citm_catalog} 10.69120 27.84880 0.00493 70.98005

jsonexamples/mesh.json deserialization

Name Min (μs) Max (μs) StdDev Ops
yyjson-{mesh} 2.33710 13.01130 0.00171 331.50569
✨ simdjson-{mesh} 2.52960 13.19230 0.00159 311.37935
orjson-{mesh} 2.88770 12.13010 0.00152 287.31080
ujson-{mesh} 3.64020 18.23620 0.00227 193.35645
json-{mesh} 5.97130 13.58290 0.00136 150.01621
rapidjson-{mesh} 7.54270 16.14480 0.00155 119.37806
simplejson-{mesh} 8.64370 16.35320 0.00136 106.25888

jsonexamples/mesh.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{mesh} 1.02020 2.74930 0.00013 919.93044
yyjson-{mesh} 2.30970 13.06730 0.00182 347.76076
orjson-{mesh} 2.85260 12.41860 0.00156 290.19432
ujson-{mesh} 3.59400 16.68610 0.00227 201.03704
json-{mesh} 5.96300 19.18900 0.00185 146.04645
rapidjson-{mesh} 7.43860 16.32260 0.00164 121.84979
simplejson-{mesh} 8.62160 21.89280 0.00221 101.30905

jsonexamples/gsoc-2018.json deserialization

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{gsoc-2018} 5.52590 16.27430 0.00178 145.59797
yyjson-{gsoc-2018} 5.62040 16.46250 0.00168 155.97459
orjson-{gsoc-2018} 5.78420 13.87300 0.00140 148.84293
simplejson-{gsoc-2018} 7.76200 15.26480 0.00142 114.98827
ujson-{gsoc-2018} 7.96570 21.53840 0.00188 110.29162
json-{gsoc-2018} 8.63300 19.26320 0.00172 102.78744
rapidjson-{gsoc-2018} 10.55570 19.20210 0.00159 85.84087

jsonexamples/gsoc-2018.json deepest key

Name Min (μs) Max (μs) StdDev Ops
✨ simdjson-{gsoc-2018} 1.56020 4.20200 0.00024 570.15046
yyjson-{gsoc-2018} 5.49930 14.89760 0.00158 161.14242
orjson-{gsoc-2018} 5.72650 15.88270 0.00160 153.18169
simplejson-{gsoc-2018} 7.70780 18.78120 0.00169 116.90299
ujson-{gsoc-2018} 7.91720 21.35300 0.00227 103.06755
json-{gsoc-2018} 8.65190 19.99580 0.00188 103.86934
rapidjson-{gsoc-2018} 10.52410 20.98870 0.00158 87.78973

About

Python bindings for the simdjson project.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%