[Python] (scalar) vector reading speedup via numpy #4390

kbrose · 2017-07-20T19:51:44Z

Reading vectors with generated Python code is slow. There have been other efforts to speed this up but they appear to have stagnated.

This PR attempts to make a minimal change that covers a use case I personally hit a lot (copying over large vectors that represent nested flatbuffers). There may be other small changes that could drastically speed up other use cases that are not included here since I am unaware of them :)

What's the change?

This PR adds support for accessing a scalar vector as a numpy array of the corresponding type. This is much faster than copying large vectors element-by-element.

Update the Table class in Python to have a GetVectorAsNumpy method which returns a zero-copy view (in numpy terminology) into a scalar vector cast as the correct type.
Update the python-code-generation code (idl_gen_python.cpp) to generate a method which wraps GetVectorAsNumpy in generated code. This method is named <field name>AsNumpy, which can be compared to the generated name of the method to get the length of a vector, <field name>Length. Attempting to use this method if numpy is not installed will result in an error, but otherwise numpy is optional.
Update appveyor CI to run Python tests.
Update python docs appropriately.

Byte data from flatbuffer vector into a Python NumPy array #4090 Byte data from flatbuffer vector into a Python NumPy array
- This PR can probably close that issue?
Reading and Writing Binary blobs is incredibly slow with the default API! #4144 Reading and Writing Binary blobs is incredibly slow with the default API!
- This PR can probably close that issue?
Python: Add numpy array accessors for vectors [WIP] #4152 Python: Add numpy array accessors for vectors [WIP]
- The PR here and that PR seem to accomplish similar things, but I haven't read enough details to say whether or not the PR here completely supersedes that one.
Python: Support Cython build #284 Python: Support Cython build
Python: Speedup with cython extension #304 Python: Speedup with cython extension (closed by author due to lack of time)

aardappel · 2017-07-21T15:50:53Z

I agree numpy support would be nice. I do think it should be optional though, if not every Python install comes with it. You could add a --numpy if needed.

kbrose · 2017-07-24T16:00:17Z

@aardappel It could probably be made optional, although it's not as easy as just adding a flag to the generator since I've put import numpy statements in the flatbuffers python package as well. The import statement could be moved into the method so that the ImportError: No module named numpy would be delayed until the user tried to call the obviously numpy-dependent method <vector name>AsNumpy(), but that would have performance implications as well. I'll think on it more.

aardappel · 2017-07-24T16:36:40Z

@kbrose can numpy support in the library code be factored into its own file?

kbrose · 2017-07-24T16:45:10Z

Yes, that could be done. It would reduce readability a little as related methods wouldn't be grouped together as much. Another possibility is to find if numpy exists using the imp module and use that to drive the logic.

kbrose · 2017-07-24T16:52:19Z

Also, it looks like python tests aren't being run in the CI, or maybe I'm just not understanding things correctly. Do you have any insight into that?

aardappel · 2017-07-25T15:45:18Z

tests/py_test.py

@@ -130,6 +130,8 @@ def asserter(stmt):
        invsum += int(v)
    asserter(invsum == 10)

+    asserter(monster.InventoryAsNumpy().sum() == 10)


maybe skip these tests if numpy is not present?

Yep - just force pushed change that should assert the correct error is raised if numpy does not exist, and I run the python tests twice (once w/o numpy, and once after numpy is installed).

aardappel · 2017-07-25T15:47:54Z

python/flatbuffers/number_types.py

+    if np is not None:
+        return np.dtype(number_type.name).newbyteorder('<')
+    else:
+        raise RuntimeError(('Numpy could not be imported'


maybe stick the error in a function?

Sorry, I don't understand. Do you want error in a function to reduce duplicated code in number_types.py and encode.py?

yes that was the idea.. though if they're in seperate files that may not be as easy?

Certainly not impossible, I could add it to the compat.py file and import it in both places. I'll go ahead and do that.

I updated to have a custom exception NumpyRequiredForThisFeature that is defined in compat.py, but I decided I didn't actually like putting a function there that throws the error. It's only repeated twice, and would make debugging slightly harder (one extra layer of indirection in the stack trace) if the error was raised.

aardappel · 2017-07-25T15:48:04Z

appveyor.yml

  - "java -version"
  - "JavaTest.bat"
  - rem "---------------- JS -----------------"
  - "node --version"
  - "..\\%CONFIGURATION%\\flatc -b -I include_test monster_test.fbs unicode_test.json"
  - "node JavaScriptTest ./monster_test_generated"
+  - rem "-------------- Python ---------------"


aardappel · 2017-07-25T15:49:36Z

appveyor.yml

@@ -7,7 +7,9 @@ os: Visual Studio 2015
 environment:
  matrix:
    - CMAKE_VS_VERSION: "10 2010"
+      TEST_ALL: "no"


what does this do?

This makes appveyor stop running as soon as it encounters an error. Useful as I'm developing to make appveyor churn through failed tests faster. I can remove this before merging if you want.

Whoops nevermind it does nothing :P I'll remove it.

aardappel · 2017-07-26T15:09:49Z

@rw can you review before this gets merged?

kbrose · 2017-07-26T15:25:35Z

I'm rebasing this into something more manageable right now

mikeholler · 2017-08-01T18:00:42Z

@aardappel how often do you publish new versions to pypi? Our engineering team here is interested in using this functionality but would prefer to do so at on official release. We'd love to know what the timeline to next release is.

rw · 2017-08-02T10:49:52Z

@mikeholler We need an automated way to create a PyPI. I haven't figured out how to set that up yet. If you or someone you know would like to help us with that, we'd appreciate it!

kbrose · 2017-08-02T13:08:53Z

Relevant: https://docs.travis-ci.com/user/deployment/pypi/

aardappel · 2017-08-02T15:09:56Z

@rw for the moment, can we push 1.7 to it so it is more up to date? Or does @mikeholler specifically want to use this PR? That would be in a 1.8 whenever it comes.

mikeholler · 2017-08-02T15:12:01Z

@aardappel we'd love to use the code provided by this PR specifically. What about it prevents it from being included in 1.7 instead of 1.8?

aardappel · 2017-08-02T15:27:41Z

1.7 is already out, it's the current official release.

aardappel · 2017-08-24T17:56:13Z

@kbrose while the numpy based tests help make sure this functionality works, they now made our tests in general flakey, see the http timeout on this test run: https://ci.appveyor.com/project/gwvo/flatbuffers/build/1.0.982

It's pulling in a LOT of dependencies, including things like openssl which I don't think we need. Can these tests be made less heavy by requiring only numpy? If not, can we make numpy tests optional (with an option to PythonTest.sh or whatever?)

aardappel · 2017-08-24T17:56:55Z

@rw may also have some ideas.

kbrose · 2017-08-24T19:16:50Z

@aardappel Sorry to see that happen. It doesn't feel good when the CI fails and it's the fault of the tests and not the package code. Unfortunately, I don't see a good way to test the numpy methods without pulling numpy from the internet, but relying on the internet during tests will make them more undeterminstic by necessity. Considering the python code was not being tested by the CI servers at all prior to this PR, you may be comfortable with not testing the numpy functions during CI tests until a more robust testing is implemented.

I'll summarize the paths forward I see:

Do not test the numpy methods in the CI (remove lines https://github.com/google/flatbuffers/blob/master/appveyor.yml#L65-L67)
Make the numpy testing a separate appveyor run, and mark those runs as allowed failures.
Reduce the number of dependencies installed by running conda install numpy --yes --no-deps, but beware that there's usually a reason the dependencies are installed...

aardappel · 2017-08-24T19:20:20Z

I'm fine with disabling it for now until we can do better. Can you make a PR for that?

We can try the --no-deps and see what happens. I still don't understand why it has those dependencies though, surely numpy itself does not rely on openssl etc.

kbrose · 2017-08-24T19:26:36Z

I would guess conda says that numpy depends on MKL because that usually speeds things up significantly, and MKL may have been looser with their deps than numpy. But that's just conjecture.

I can try and create a PR when I get home in the evening, but it is equivalent to just deleting (or commenting) those three lines in appveyor.yml I highlighted above.

aardappel · 2017-08-24T19:28:22Z

Ok, I can fix that.

aardappel · 2017-08-25T18:09:39Z

disabled numpy tests for now: 1f0bd12

kbrose · 2017-09-12T20:49:01Z

Hello @aardappel I hate to be that person pestering the over-worked OSS maintainer, but do you have an idea of when the next release of flatbuffers will be? I'd like to be able to target an actual release number when managing dependencies but also want to use this feature.

aardappel · 2017-09-13T00:35:17Z

There's no schedule for that.. currently it seems to happen once every 6 months or so, or whenever a significant chunk of new functionality has accumulated.

I'd like to help you out by making releases whenever people ask for them, but thats difficult, as different people need different features.

The releases really aren't a lot more stable than regular commits, if you need this feature I'd really recommend to just use FlatBuffers at this commit.

kbrose · 2017-11-22T22:17:09Z

Greetings! Thank you so much @aardappel for releasing the new version and congrats on 1.8!

@rw Any way we could publish the new version of the python code to pypi? I can try to help out with any pain points if you let me know what issues exist with that publication process.

rw · 2017-11-23T04:18:39Z

@kbrose Thanks for your offer, we'll take you up on it. Our pain paint is that shipping a new pypi package is a 100% manual process right now. Do you know of any good ways to partially or totally automate that?

ahundt · 2017-11-23T18:07:37Z

Looks like travis might be able to do that:
https://docs.travis-ci.com/user/deployment/pypi/
http://www.robinandeer.com/blog/2016/09/01/automated-pypi-releases/

mikeholler · 2017-11-23T18:35:28Z

Yeah, I was looking into a Travis solution. The one thing I'm having a hard time with is figuring out how Travis knows where to find the setup.py file or the sdist tar.gz. All the projects I've seen have setup.py in the directory root, and I don't see a way to configure Travis to look in a subdirectory like this project has. I'll keep looking into it, but if anyone has guidance on doing this that'd be very appreciated.

…

________________________________ From: Andrew Hundt <notifications@github.com> Sent: Thursday, November 23, 2017 12:07:51 PM To: google/flatbuffers Cc: Mike Holler; Mention Subject: Re: [google/flatbuffers] [Python] (scalar) vector reading speedup via numpy (#4390) Looks like travis might be able to do that: https://docs.travis-ci.com/user/deployment/pypi/ http://www.robinandeer.com/blog/2016/09/01/automated-pypi-releases/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4390 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHDULU3Ckmi4pBpcZy7OY_6iJgyePNOks5s5bR3gaJpZM4Oelv7>.

ahundt · 2017-11-25T05:42:28Z

@mikeholler perhaps @gwvo (@aardappel) is willing to simply move setup.py to the root location?

rw · 2017-11-25T08:57:27Z

Ideally we would not put setup.py into the root, because the semantics don't make sense: we only want the contents of the python directory to be part of the python package.

We could have a separate tracking repo for this purpose.

mikeholler · 2017-11-25T15:23:20Z

Yeah, I think keeping setup.py out of the root is a good idea (although I do find it odd that the Java pom is in the root). I'm not so sure about having a separate repo. I could see a few disadvantages that could (for example) make testing harder. I still have hope for a solution that uses Travis but doesn't change the structure of this repo much. There has got to be a way to support projects with setup.py not in the root. If anyone finds something let me know. I fiddled with Travis for about an hour on Wednesday and I should be able to find more time soon. From: Robert Sent: Saturday, November 25, 02:57 Subject: Re: [google/flatbuffers] [Python] (scalar) vector reading speedup via numpy (#4390) To: google/flatbuffers Cc: Mike Holler, Mention Ideally we would not put setup.py into the root, because the semantics don't make sense: we only want the contents of the python directory to be part of the python package. We could have a separate tracking repo for this purpose. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#4390 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHDUPSRrudnrUlJ3lq4nZf_rQvdSSXtks5s59aEgaJpZM4Oelv7>.

mikeholler · 2017-11-25T16:33:47Z

I provided more info on the subdirectory issue in #4507. We might consider moving further discussion there for PyPI publication.

ahundt · 2017-11-29T02:17:59Z

FYI I've seen other big projects that support multiple languages and use PyPi put setup.py in the root directory without issue. One example I can think of off the top of my head is https://github.com/bulletphysics/bullet3 and its pypi package pybullet. I don't have strong feelings, just wanted to give an example that works.

Amit072 · 2018-04-26T11:54:06Z

Is there any equivalent API for writing the large byte-arrays into builder without using PrependByte() ?

kbrose force-pushed the python-vector-speedup branch from ad2d1fd to 2ad5449 Compare July 20, 2017 20:38

kbrose force-pushed the python-vector-speedup branch 12 times, most recently from 918ccc7 to 2faaafa Compare July 24, 2017 23:29

aardappel reviewed Jul 25, 2017

View reviewed changes

kbrose force-pushed the python-vector-speedup branch from 4ace416 to ff3a594 Compare July 25, 2017 15:50

kbrose force-pushed the python-vector-speedup branch 2 times, most recently from c1a806a to 7339847 Compare July 26, 2017 15:42

kbrose changed the title ~~WIP: [Python] (scalar) vector reading speedup via numpy~~ [Python] (scalar) vector reading speedup via numpy Jul 26, 2017

kbrose force-pushed the python-vector-speedup branch from bdbd870 to 2484188 Compare July 26, 2017 17:08

kbrose added 4 commits July 26, 2017 13:20

Add numpy accessor to python flatbuffers scalar vectors

937c499

Update python tests to test numpy vector accessor

c97ca6c

Update appveyor CI to run Python tests, save generated code as artifact

b03fcaf

Update example generated python code

24f38cf

kbrose deleted the python-vector-speedup branch August 1, 2017 18:29

rw mentioned this pull request Oct 5, 2017

Flexbuffer or bytearray support for python [python 3.x, flatc 1.7.1, Ubuntu 16.04 LTS, flatbuffers 2015.5.14] #4447

Closed

mikeholler mentioned this pull request Nov 22, 2017

What can I do to help make PyPi publications a reality? #4507

Closed

aardappel mentioned this pull request Jun 1, 2018

[Python] AttributeError: 'Table' object has no attribute 'GetVectorAsNumpy' #4765

Closed

rw mentioned this pull request Aug 9, 2020

Reading and Writing Binary blobs is incredibly slow with the default API! #4144

Closed

[Python] (scalar) vector reading speedup via numpy #4390

[Python] (scalar) vector reading speedup via numpy #4390

Conversation

kbrose commented Jul 20, 2017 • edited Loading

What's the change?

See also

aardappel commented Jul 21, 2017

kbrose commented Jul 24, 2017

aardappel commented Jul 24, 2017

kbrose commented Jul 24, 2017

kbrose commented Jul 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aardappel commented Jul 26, 2017

kbrose commented Jul 26, 2017

mikeholler commented Aug 1, 2017

rw commented Aug 2, 2017

kbrose commented Aug 2, 2017

aardappel commented Aug 2, 2017

mikeholler commented Aug 2, 2017

aardappel commented Aug 2, 2017

aardappel commented Aug 24, 2017

aardappel commented Aug 24, 2017

kbrose commented Aug 24, 2017

aardappel commented Aug 24, 2017

kbrose commented Aug 24, 2017

aardappel commented Aug 24, 2017

aardappel commented Aug 25, 2017

kbrose commented Sep 12, 2017

aardappel commented Sep 13, 2017

kbrose commented Nov 22, 2017

rw commented Nov 23, 2017

ahundt commented Nov 23, 2017

mikeholler commented Nov 23, 2017 via email

ahundt commented Nov 25, 2017 • edited Loading

rw commented Nov 25, 2017

mikeholler commented Nov 25, 2017 via email

mikeholler commented Nov 25, 2017

ahundt commented Nov 29, 2017 • edited Loading

Amit072 commented Apr 26, 2018

kbrose commented Jul 20, 2017 •

edited

Loading

ahundt commented Nov 25, 2017 •

edited

Loading

ahundt commented Nov 29, 2017 •

edited

Loading