Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] (scalar) vector reading speedup via numpy #4390

Merged
merged 8 commits into from
Aug 1, 2017

Conversation

kbrose
Copy link
Contributor

@kbrose kbrose commented Jul 20, 2017

Reading vectors with generated Python code is slow. There have been other efforts to speed this up but they appear to have stagnated.

This PR attempts to make a minimal change that covers a use case I personally hit a lot (copying over large vectors that represent nested flatbuffers). There may be other small changes that could drastically speed up other use cases that are not included here since I am unaware of them :)

What's the change?

This PR adds support for accessing a scalar vector as a numpy array of the corresponding type. This is much faster than copying large vectors element-by-element.

  • Update the Table class in Python to have a GetVectorAsNumpy method which returns a zero-copy view (in numpy terminology) into a scalar vector cast as the correct type.
  • Update the python-code-generation code (idl_gen_python.cpp) to generate a method which wraps GetVectorAsNumpy in generated code. This method is named <field name>AsNumpy, which can be compared to the generated name of the method to get the length of a vector, <field name>Length. Attempting to use this method if numpy is not installed will result in an error, but otherwise numpy is optional.
  • Update appveyor CI to run Python tests.
  • Update python docs appropriately.

See also

@kbrose kbrose force-pushed the python-vector-speedup branch from ad2d1fd to 2ad5449 Compare July 20, 2017 20:38
@aardappel
Copy link
Collaborator

I agree numpy support would be nice. I do think it should be optional though, if not every Python install comes with it. You could add a --numpy if needed.

@kbrose
Copy link
Contributor Author

kbrose commented Jul 24, 2017

@aardappel It could probably be made optional, although it's not as easy as just adding a flag to the generator since I've put import numpy statements in the flatbuffers python package as well. The import statement could be moved into the method so that the ImportError: No module named numpy would be delayed until the user tried to call the obviously numpy-dependent method <vector name>AsNumpy(), but that would have performance implications as well. I'll think on it more.

@aardappel
Copy link
Collaborator

@kbrose can numpy support in the library code be factored into its own file?

@kbrose
Copy link
Contributor Author

kbrose commented Jul 24, 2017

Yes, that could be done. It would reduce readability a little as related methods wouldn't be grouped together as much. Another possibility is to find if numpy exists using the imp module and use that to drive the logic.

@kbrose
Copy link
Contributor Author

kbrose commented Jul 24, 2017

Also, it looks like python tests aren't being run in the CI, or maybe I'm just not understanding things correctly. Do you have any insight into that?

@kbrose kbrose force-pushed the python-vector-speedup branch 12 times, most recently from 918ccc7 to 2faaafa Compare July 24, 2017 23:29
tests/py_test.py Outdated
@@ -130,6 +130,8 @@ def asserter(stmt):
invsum += int(v)
asserter(invsum == 10)

asserter(monster.InventoryAsNumpy().sum() == 10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe skip these tests if numpy is not present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - just force pushed change that should assert the correct error is raised if numpy does not exist, and I run the python tests twice (once w/o numpy, and once after numpy is installed).

if np is not None:
return np.dtype(number_type.name).newbyteorder('<')
else:
raise RuntimeError(('Numpy could not be imported'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe stick the error in a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. Do you want error in a function to reduce duplicated code in number_types.py and encode.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was the idea.. though if they're in seperate files that may not be as easy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not impossible, I could add it to the compat.py file and import it in both places. I'll go ahead and do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to have a custom exception NumpyRequiredForThisFeature that is defined in compat.py, but I decided I didn't actually like putting a function there that throws the error. It's only repeated twice, and would make debugging slightly harder (one extra layer of indirection in the stack trace) if the error was raised.

- "java -version"
- "JavaTest.bat"
- rem "---------------- JS -----------------"
- "node --version"
- "..\\%CONFIGURATION%\\flatc -b -I include_test monster_test.fbs unicode_test.json"
- "node JavaScriptTest ./monster_test_generated"
- rem "-------------- Python ---------------"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

appveyor.yml Outdated
@@ -7,7 +7,9 @@ os: Visual Studio 2015
environment:
matrix:
- CMAKE_VS_VERSION: "10 2010"
TEST_ALL: "no"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes appveyor stop running as soon as it encounters an error. Useful as I'm developing to make appveyor churn through failed tests faster. I can remove this before merging if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops nevermind it does nothing :P I'll remove it.

@kbrose kbrose force-pushed the python-vector-speedup branch from 4ace416 to ff3a594 Compare July 25, 2017 15:50
@aardappel
Copy link
Collaborator

@rw can you review before this gets merged?

@kbrose
Copy link
Contributor Author

kbrose commented Jul 26, 2017

I'm rebasing this into something more manageable right now

@kbrose kbrose force-pushed the python-vector-speedup branch 2 times, most recently from c1a806a to 7339847 Compare July 26, 2017 15:42
@kbrose kbrose changed the title WIP: [Python] (scalar) vector reading speedup via numpy [Python] (scalar) vector reading speedup via numpy Jul 26, 2017
@kbrose kbrose force-pushed the python-vector-speedup branch from bdbd870 to 2484188 Compare July 26, 2017 17:08
@mikeholler
Copy link
Contributor

@aardappel how often do you publish new versions to pypi? Our engineering team here is interested in using this functionality but would prefer to do so at on official release. We'd love to know what the timeline to next release is.

@kbrose kbrose deleted the python-vector-speedup branch August 1, 2017 18:29
@rw
Copy link
Contributor

rw commented Aug 2, 2017

@mikeholler We need an automated way to create a PyPI. I haven't figured out how to set that up yet. If you or someone you know would like to help us with that, we'd appreciate it!

@kbrose
Copy link
Contributor Author

kbrose commented Aug 2, 2017

@aardappel
Copy link
Collaborator

@rw for the moment, can we push 1.7 to it so it is more up to date? Or does @mikeholler specifically want to use this PR? That would be in a 1.8 whenever it comes.

@mikeholler
Copy link
Contributor

@aardappel we'd love to use the code provided by this PR specifically. What about it prevents it from being included in 1.7 instead of 1.8?

@aardappel
Copy link
Collaborator

1.7 is already out, it's the current official release.

@aardappel
Copy link
Collaborator

@kbrose while the numpy based tests help make sure this functionality works, they now made our tests in general flakey, see the http timeout on this test run: https://ci.appveyor.com/project/gwvo/flatbuffers/build/1.0.982

It's pulling in a LOT of dependencies, including things like openssl which I don't think we need. Can these tests be made less heavy by requiring only numpy? If not, can we make numpy tests optional (with an option to PythonTest.sh or whatever?)

@aardappel
Copy link
Collaborator

@rw may also have some ideas.

@kbrose
Copy link
Contributor Author

kbrose commented Aug 24, 2017

@aardappel Sorry to see that happen. It doesn't feel good when the CI fails and it's the fault of the tests and not the package code. Unfortunately, I don't see a good way to test the numpy methods without pulling numpy from the internet, but relying on the internet during tests will make them more undeterminstic by necessity. Considering the python code was not being tested by the CI servers at all prior to this PR, you may be comfortable with not testing the numpy functions during CI tests until a more robust testing is implemented.

I'll summarize the paths forward I see:

  1. Do not test the numpy methods in the CI (remove lines https://github.com/google/flatbuffers/blob/master/appveyor.yml#L65-L67)
  2. Make the numpy testing a separate appveyor run, and mark those runs as allowed failures.
  3. Reduce the number of dependencies installed by running conda install numpy --yes --no-deps, but beware that there's usually a reason the dependencies are installed...

@aardappel
Copy link
Collaborator

I'm fine with disabling it for now until we can do better. Can you make a PR for that?

We can try the --no-deps and see what happens. I still don't understand why it has those dependencies though, surely numpy itself does not rely on openssl etc.

@kbrose
Copy link
Contributor Author

kbrose commented Aug 24, 2017

I would guess conda says that numpy depends on MKL because that usually speeds things up significantly, and MKL may have been looser with their deps than numpy. But that's just conjecture.

I can try and create a PR when I get home in the evening, but it is equivalent to just deleting (or commenting) those three lines in appveyor.yml I highlighted above.

@aardappel
Copy link
Collaborator

Ok, I can fix that.

@aardappel
Copy link
Collaborator

disabled numpy tests for now: 1f0bd12

@kbrose
Copy link
Contributor Author

kbrose commented Sep 12, 2017

Hello @aardappel I hate to be that person pestering the over-worked OSS maintainer, but do you have an idea of when the next release of flatbuffers will be? I'd like to be able to target an actual release number when managing dependencies but also want to use this feature.

@aardappel
Copy link
Collaborator

There's no schedule for that.. currently it seems to happen once every 6 months or so, or whenever a significant chunk of new functionality has accumulated.

I'd like to help you out by making releases whenever people ask for them, but thats difficult, as different people need different features.

The releases really aren't a lot more stable than regular commits, if you need this feature I'd really recommend to just use FlatBuffers at this commit.

@kbrose
Copy link
Contributor Author

kbrose commented Nov 22, 2017

Greetings! Thank you so much @aardappel for releasing the new version and congrats on 1.8!

@rw Any way we could publish the new version of the python code to pypi? I can try to help out with any pain points if you let me know what issues exist with that publication process.

@rw
Copy link
Contributor

rw commented Nov 23, 2017

@kbrose Thanks for your offer, we'll take you up on it. Our pain paint is that shipping a new pypi package is a 100% manual process right now. Do you know of any good ways to partially or totally automate that?

@ahundt
Copy link

ahundt commented Nov 23, 2017

@mikeholler
Copy link
Contributor

mikeholler commented Nov 23, 2017 via email

@ahundt
Copy link

ahundt commented Nov 25, 2017

@mikeholler perhaps @gwvo (@aardappel) is willing to simply move setup.py to the root location?

@rw
Copy link
Contributor

rw commented Nov 25, 2017

Ideally we would not put setup.py into the root, because the semantics don't make sense: we only want the contents of the python directory to be part of the python package.

We could have a separate tracking repo for this purpose.

@mikeholler
Copy link
Contributor

mikeholler commented Nov 25, 2017 via email

@mikeholler
Copy link
Contributor

I provided more info on the subdirectory issue in #4507. We might consider moving further discussion there for PyPI publication.

@ahundt
Copy link

ahundt commented Nov 29, 2017

FYI I've seen other big projects that support multiple languages and use PyPi put setup.py in the root directory without issue. One example I can think of off the top of my head is https://github.com/bulletphysics/bullet3 and its pypi package pybullet. I don't have strong feelings, just wanted to give an example that works.

@Amit072
Copy link

Amit072 commented Apr 26, 2018

Is there any equivalent API for writing the large byte-arrays into builder without using PrependByte() ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants