-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] (scalar) vector reading speedup via numpy #4390
Conversation
ad2d1fd
to
2ad5449
Compare
I agree numpy support would be nice. I do think it should be optional though, if not every Python install comes with it. You could add a |
@aardappel It could probably be made optional, although it's not as easy as just adding a flag to the generator since I've put |
@kbrose can numpy support in the library code be factored into its own file? |
Yes, that could be done. It would reduce readability a little as related methods wouldn't be grouped together as much. Another possibility is to find if |
Also, it looks like python tests aren't being run in the CI, or maybe I'm just not understanding things correctly. Do you have any insight into that? |
918ccc7
to
2faaafa
Compare
tests/py_test.py
Outdated
@@ -130,6 +130,8 @@ def asserter(stmt): | |||
invsum += int(v) | |||
asserter(invsum == 10) | |||
|
|||
asserter(monster.InventoryAsNumpy().sum() == 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe skip these tests if numpy is not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - just force pushed change that should assert the correct error is raised if numpy does not exist, and I run the python tests twice (once w/o numpy, and once after numpy is installed).
python/flatbuffers/number_types.py
Outdated
if np is not None: | ||
return np.dtype(number_type.name).newbyteorder('<') | ||
else: | ||
raise RuntimeError(('Numpy could not be imported' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe stick the error in a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. Do you want error in a function to reduce duplicated code in number_types.py and encode.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that was the idea.. though if they're in seperate files that may not be as easy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not impossible, I could add it to the compat.py
file and import it in both places. I'll go ahead and do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to have a custom exception NumpyRequiredForThisFeature that is defined in compat.py, but I decided I didn't actually like putting a function there that throws the error. It's only repeated twice, and would make debugging slightly harder (one extra layer of indirection in the stack trace) if the error was raised.
- "java -version" | ||
- "JavaTest.bat" | ||
- rem "---------------- JS -----------------" | ||
- "node --version" | ||
- "..\\%CONFIGURATION%\\flatc -b -I include_test monster_test.fbs unicode_test.json" | ||
- "node JavaScriptTest ./monster_test_generated" | ||
- rem "-------------- Python ---------------" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
appveyor.yml
Outdated
@@ -7,7 +7,9 @@ os: Visual Studio 2015 | |||
environment: | |||
matrix: | |||
- CMAKE_VS_VERSION: "10 2010" | |||
TEST_ALL: "no" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes appveyor stop running as soon as it encounters an error. Useful as I'm developing to make appveyor churn through failed tests faster. I can remove this before merging if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops nevermind it does nothing :P I'll remove it.
4ace416
to
ff3a594
Compare
@rw can you review before this gets merged? |
I'm rebasing this into something more manageable right now |
c1a806a
to
7339847
Compare
bdbd870
to
2484188
Compare
@aardappel how often do you publish new versions to pypi? Our engineering team here is interested in using this functionality but would prefer to do so at on official release. We'd love to know what the timeline to next release is. |
@mikeholler We need an automated way to create a PyPI. I haven't figured out how to set that up yet. If you or someone you know would like to help us with that, we'd appreciate it! |
@rw for the moment, can we push 1.7 to it so it is more up to date? Or does @mikeholler specifically want to use this PR? That would be in a 1.8 whenever it comes. |
@aardappel we'd love to use the code provided by this PR specifically. What about it prevents it from being included in 1.7 instead of 1.8? |
1.7 is already out, it's the current official release. |
@kbrose while the numpy based tests help make sure this functionality works, they now made our tests in general flakey, see the http timeout on this test run: https://ci.appveyor.com/project/gwvo/flatbuffers/build/1.0.982 It's pulling in a LOT of dependencies, including things like openssl which I don't think we need. Can these tests be made less heavy by requiring only numpy? If not, can we make numpy tests optional (with an option to PythonTest.sh or whatever?) |
@rw may also have some ideas. |
@aardappel Sorry to see that happen. It doesn't feel good when the CI fails and it's the fault of the tests and not the package code. Unfortunately, I don't see a good way to test the numpy methods without pulling numpy from the internet, but relying on the internet during tests will make them more undeterminstic by necessity. Considering the python code was not being tested by the CI servers at all prior to this PR, you may be comfortable with not testing the numpy functions during CI tests until a more robust testing is implemented. I'll summarize the paths forward I see:
|
I'm fine with disabling it for now until we can do better. Can you make a PR for that? We can try the |
I would guess conda says that numpy depends on MKL because that usually speeds things up significantly, and MKL may have been looser with their deps than numpy. But that's just conjecture. I can try and create a PR when I get home in the evening, but it is equivalent to just deleting (or commenting) those three lines in appveyor.yml I highlighted above. |
Ok, I can fix that. |
disabled numpy tests for now: 1f0bd12 |
Hello @aardappel I hate to be that person pestering the over-worked OSS maintainer, but do you have an idea of when the next release of flatbuffers will be? I'd like to be able to target an actual release number when managing dependencies but also want to use this feature. |
There's no schedule for that.. currently it seems to happen once every 6 months or so, or whenever a significant chunk of new functionality has accumulated. I'd like to help you out by making releases whenever people ask for them, but thats difficult, as different people need different features. The releases really aren't a lot more stable than regular commits, if you need this feature I'd really recommend to just use FlatBuffers at this commit. |
Greetings! Thank you so much @aardappel for releasing the new version and congrats on 1.8! @rw Any way we could publish the new version of the python code to pypi? I can try to help out with any pain points if you let me know what issues exist with that publication process. |
@kbrose Thanks for your offer, we'll take you up on it. Our pain paint is that shipping a new pypi package is a 100% manual process right now. Do you know of any good ways to partially or totally automate that? |
Looks like travis might be able to do that: |
Yeah, I was looking into a Travis solution. The one thing I'm having a hard time with is figuring out how Travis knows where to find the setup.py file or the sdist tar.gz. All the projects I've seen have setup.py in the directory root, and I don't see a way to configure Travis to look in a subdirectory like this project has.
I'll keep looking into it, but if anyone has guidance on doing this that'd be very appreciated.
…________________________________
From: Andrew Hundt <notifications@github.com>
Sent: Thursday, November 23, 2017 12:07:51 PM
To: google/flatbuffers
Cc: Mike Holler; Mention
Subject: Re: [google/flatbuffers] [Python] (scalar) vector reading speedup via numpy (#4390)
Looks like travis might be able to do that:
https://docs.travis-ci.com/user/deployment/pypi/
http://www.robinandeer.com/blog/2016/09/01/automated-pypi-releases/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4390 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHDULU3Ckmi4pBpcZy7OY_6iJgyePNOks5s5bR3gaJpZM4Oelv7>.
|
@mikeholler perhaps @gwvo (@aardappel) is willing to simply move setup.py to the root location? |
Ideally we would not put We could have a separate tracking repo for this purpose. |
Yeah, I think keeping setup.py out of the root is a good idea (although I do find it odd that the Java pom is in the root). I'm not so sure about having a separate repo. I could see a few disadvantages that could (for example) make testing harder.
I still have hope for a solution that uses Travis but doesn't change the structure of this repo much. There has got to be a way to support projects with setup.py not in the root. If anyone finds something let me know. I fiddled with Travis for about an hour on Wednesday and I should be able to find more time soon.
From: Robert
Sent: Saturday, November 25, 02:57
Subject: Re: [google/flatbuffers] [Python] (scalar) vector reading speedup via numpy (#4390)
To: google/flatbuffers
Cc: Mike Holler, Mention
Ideally we would not put setup.py into the root, because the semantics don't make sense: we only want the contents of the python directory to be part of the python package.
We could have a separate tracking repo for this purpose.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#4390 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHDUPSRrudnrUlJ3lq4nZf_rQvdSSXtks5s59aEgaJpZM4Oelv7>.
|
I provided more info on the subdirectory issue in #4507. We might consider moving further discussion there for PyPI publication. |
FYI I've seen other big projects that support multiple languages and use PyPi put setup.py in the root directory without issue. One example I can think of off the top of my head is https://github.com/bulletphysics/bullet3 and its pypi package pybullet. I don't have strong feelings, just wanted to give an example that works. |
Is there any equivalent API for writing the large byte-arrays into builder without using PrependByte() ? |
Reading vectors with generated Python code is slow. There have been other efforts to speed this up but they appear to have stagnated.
This PR attempts to make a minimal change that covers a use case I personally hit a lot (copying over large vectors that represent nested flatbuffers). There may be other small changes that could drastically speed up other use cases that are not included here since I am unaware of them :)
What's the change?
This PR adds support for accessing a scalar vector as a numpy array of the corresponding type. This is much faster than copying large vectors element-by-element.
Table
class in Python to have aGetVectorAsNumpy
method which returns a zero-copy view (in numpy terminology) into a scalar vector cast as the correct type.idl_gen_python.cpp
) to generate a method which wrapsGetVectorAsNumpy
in generated code. This method is named<field name>AsNumpy
, which can be compared to the generated name of the method to get the length of a vector,<field name>Length
. Attempting to use this method if numpy is not installed will result in an error, but otherwise numpy is optional.See also