Skip to content
This repository has been archived by the owner on Aug 5, 2023. It is now read-only.

Pypy support #10

Open
arthurdarcet opened this issue Apr 16, 2018 · 5 comments
Open

Pypy support #10

arthurdarcet opened this issue Apr 16, 2018 · 5 comments

Comments

@arthurdarcet
Copy link

This lib could be useful to manipulate large amounts of data ; and Pypy can help speeding thing up for those use-cases, but Pypy is marked compatible only with python 3.5, and cannot be installed with the python_requires=3.6 specified in the setup.py file.

I don't think anything in the code requires 3.6 other than the async generators. Would you consider a PR lowering the python_requires to 3.5 and add a dependency on async_generator to support 3.5 ?

@gusutabopb
Copy link
Owner

gusutabopb commented Apr 17, 2018

Thanks for raising the issue!

There seems to be some demand for Pypy, but I would like to understand a bit more why is that. Currently I am not super enthusiastic about supporting older Python versions or Pypy, but perhaps some discussion here can change mind.

As far as aioinflux is concerned, the most CPU-intensive part of the code is the serialization part (i.e., transforming a dictionary/dataframe to a line protocol string). I have been aware of that and have stated it in the README:

Beware that serialization is not highly optimized (cythonization PRs are welcome!) and may become a bottleneck depending on your application.

As a side note, please understand that the current pure Python line protocol serialization implementation is somewhat optimized, and according to my simple benchmarks, it is 7-8x faster than the official Python client (which is a pity and one of the many reasons I decided to write my own InfluxDB client library from scratch).

Serialization can likely be improved a bit (hopefully 2-3x) using Cython, or perhaps even more by writing a C extension from scratch (such as ciso8601).

So I would like some clarification on:

  1. Where does the the need to to use Pypy come from? Is it related to aioinflux? Is it because you depend on Pypy for some other part of your application (which needs aioinflux as a dependency)?
  2. If it comes from aioinflux, is it because of a serialization issue or some other part of the code?
  3. Would faster serialization using Cython/C extensions also be a solution to your current issue? Why or why not?
  4. How much faster do you expect the serialization code (or aioinflux as a whole) to run with Pypy?

My concerns are:

  1. Pypy support for Python 3 seems to be quite beta, and they have always lagged behind the reference CPython implementation. Supporting Pypy will likely mean we have to be stuck on an old Python version for years to come. I don't think it to be unlikely Pypy will still be on Python 3.5 when Python 3.9 comes out in 3-4 years time.
  2. Keeping decent unit tests/coverage for Pypy may be a big burden and being stuck in an old Python version may restrict the addition of new functionality.
  3. If I implement a Cython/C Extension improvement (which I plan to someday) we may have to keep separate serialization implementations: 1) a pure Python serialization implementation (for Pypy use) and 2) a Cython/C extension implementation for use in CPython use. Needless to say this might not be the nicest thing to maintain.

By the way, @miracle2k has made his own fork of aionflux, which supports trio/curio/asks and Python 3.5 (and I assume Pypy as well). It would also be nice to hear some opinions from other contributors: @steersbob, @carlos-jenkins.

@steersbob
Copy link
Contributor

Note: I have little to no practical experience with PyPy.

That said: This seems to be the primary influx + asyncio library. It may be desirable to focus on doing the core feature well, instead of supporting everything up to and including a mail client.

I may be available to write a C serializer, but that'd depend on how much effort it would be, and what the expected performance gain would be.

@arthurdarcet
Copy link
Author

Thanks for you detailed response! Here are some answers to your points

I am definitely not enthusiastic about supporting old versions of Python either, and I was not suggesting supporting anything other than pypy-3.
The Python3 flavor of the pypy releases are not beta anymore though - we have been using them in quite a few production projects for some time now - and they plan on moving the compatibility version to 3.6 in 2018 [1]
From what I saw in other projects, supporting Pypy is often nothing more than toggling "on" pypy in the CI matrix, even with C-extensions.

(Concerning the C-parser/serialiser, I don't have much experience with C extensions, but I think using CFFI works very well with both CPython and Pypy)

And lastly, regarding use-case: I understand that for the aioinflux part of the code, once the C parser/serialiser has been implemented, running on Pypy will not improve the performances at all. But we are using aioinflux to connect to a large InfluxDB, with some data processing that we would like to be able to do efficiently in our Python code. Using Pypy for this processing would definitely speed thing up.

I understand that supporting old versions is definitely not an objective here, but I think processing data on the Python side is a common use-case, and supporting Pypy probably won't impact the code much here.

Finally: I think we have three options here:

  • I open a pull request that adds async_generator as a dependency, lowers the python_requires to 3.5 and enables the CI on pypy (hopefully that will be enough to support pypy3)
  • You put this "on-hold" until pypy releases a version compatible with python 3.6 some time this year, and then everything should work out of the box
  • Or you can of course close this issue and decide pypy is out of scope

[1] https://morepypy.blogspot.fr/2017/12/pypy27-and-pypy35-v510-dual-release.html

@gusutabopb
Copy link
Owner

@arthurdarcet: I just released v0.3.0 so now perhaps is a good time for you to fork and make the necessary adaptations.

Besides async generators, aioinflux also makes heavy use of f-strings (which are 3.6+ only). They are quite useful for serialization and considerably faster than regular str.format formatting (I have no idea why, though). I have used them to speed up dataframe serialization and am now thinking of applying them back to regular mapping/dictionary parsing.

I think it'd be nice supporting Pypy, but as stated above I am a bit hesitant due to maintainability issues. From the three options you gave, I am a bit more inclined to option two, but feel free to make a fork and open a pull request (option one). We can take a look at it and try to merge it to the master branch. If you don't need Pandas support, dropping dataframe support for the Pypy version would also be an option.

@arthurdarcet
Copy link
Author

arthurdarcet commented Apr 24, 2018

f-strings are available in pypy3 since 5.8, so this won't be an issue. I'm not exactly sure how we can specify in the setup file that the project requires python 3.6 or pypy3 5.8... I'll look into that.

I probably won't have time to work on this in the next few days, but I'll open a PR when a get a chance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants