-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for apache arrow. #1232
Comments
Yes, we seriously considered Arrow (formerly Feather) as the main backend format for storing data on-disk. However, after careful consideration, we had to conclude that the format could not meet our needs:
In terms of saving datatable Frames into or loading from feather files, this feature was included in our original roadmap (#290), and it still remains there. |
Isn't that a feature actually? You won't be usually able to fit such a large data into RAM anyway, and it also makes some operations faster, since its easier to update 1 of 10 2B arrays instead of whole 20B one. Most importantly, you should want to make array methods run in constant memory (similar to spark). This specification makes it obvious, so you don't have to repeat pandas mistakes.
That depends where are you getting the data from. If its your pipeline, that saves the data from sensors, then you can save them to apache arrow directly. If its pandas dataframe / spark / parquet, which use apache arrow, then you again already have apache arrow dataframe. If its another source, then you truly have to preproccess them manually, but I am not able to estimate how relevant this scenario is. |
Uncompressed 2B elements of double precision is < 15GB. So on high end machine which has 2TB you can fit more than 270B elements. OK not every one has 2TB, but it is better to be future proof, and not after a year or two, re-implement this. By asking for update 1 of 10 2B arrays you basically asking for partitioning, which could be implemented anyway. |
@AnthonyJacob Suppose I have an array of 10B doubles. That's 80GB of memory, more than my laptop can handle. But if that array lives on a hard drive in a Jay file, I simply memory-map the entire file, and now it behaves as-if it was loaded into RAM. But it wasn't. The data is still on disk, and the RAM is free. If you need to update a single value somewhere in the middle of that array, the operating system will load just 1 page (4KB) of data into memory, update that value, and then save it back to disk. (This is all assuming we opened the file in read-write mode). |
We support arrays > 2^31 elements. This was changed a pretty long time ago. There might be a document here or there which still says that lengths are limited to int32_t, we just need to fix that.
This isn't true -- memory mapping still works fine, but the null bits and the values are in different memory regions. |
For what it's worth, I have reviewed pydatatable's bespoke columnar in-memory format and don't see reasons why you could not use Arrow as your primary runtime memory format (in-memory and memory-mapped on-disk). There are some minor things like string values with > 2B length (there is an open issue about adding a "large binary" type with 64-bit offsets). It's a bit too bad as we are developing a lot of software that could be utilized natively in this project without need for any data marshalling. This would be a valuable discussion to have as an open source community; if there's something that the Arrow community could build to be of service, it would be good to know. |
@wesm 1. Why did we not go with Apache Arrow from the beginning? 2. Could we use Arrow as the primary in-memory and on-disk format?
3. Could we at least allow conversions from/to Arrow Table? |
hi @st-pasha,
You wrote:
This isn't correct. Here is the blog post introducing the Feather format 2.5 years ago Critically, I wrote:
Feather is very clearly from its first announcement, a minimalist storage format that uses the Arrow format, and a compact subset of it. Feather was a 2 week experiment to socialize the idea of interoperable data frames with the R and Python communities -- a success, I think, but I guess that it's led to some misconceptions about the Arrow project in general.
I would be interested in exploring the NA analysis in more depth with you. My guess (which we can verify through benchmarking) is that the performance difference between bitmasks and sentinel values is not going to drive the bottom-line performance of these projects. Bitmasks have other significant benefits:
One of the main operations where performance may be somewhat degraded is reductions like On
There is nothing to stop you from annotating a type with custom metadata to indicate that you are using a sentinel NA values. The downside is that a foreign Arrow consumer won't necessary know about your special values, but I believe that's an acceptable compromise.
Since you don't support nested data, converting from flat Arrow tables to the tables in this library is pretty simple, so even if you never move to Arrow natively internally, you can still benefit from the Arrow ecosystem with an extra marshal step. |
Another matter to keep in mind: MPL 2.0-licensed code cannot be reused in projects (except in binary form) in the Apache Software Foundation. |
Ah, my bad. Being the author, you, of course, know better. I guess what I meant was that, purely from a user's perspective, there was a package called "feather(-format)". And that package was used to create ".feather" files. The Arrow, as a technology, did not have a standalone presence in python. Today, the feather package was merged into the
Generally, MPL-2-licensed code can be included in source form into other projects, including into the Apache-v2 projects, provided the following condition is met: The included source files retain their MPL-2 license. Your project will still be Apache-v2, but somewhere inside there will be an island of MPL-licensed code. Your project can still be used freely, for any purpose, except that the license of that small MPL piece cannot be altered. This licensing peculiarity really only becomes relevant if someone were to take your project, modify it, then relicense under a closed-source license (i.e. GPL or commercial software), and finally distribute -- then and then only they will be forced to publish whatever changes they made to the MPL piece. Now, it is possible that Apache Software Foundation puts additional restrictions on which licenses can or cannot be used, beyond what is allowed by Apache-v2 license. However, I do not know how to change their mind. |
I don't know what you're saying; factually speaking this is not correct either. It's true that the libraries did not have a huge install base at the time that this repository's history commenced, but to say that an active Arrow-for-Python project did not exist is not accurate. You can have a look at the source tree for yourself as of 2/2017 https://github.com/apache/arrow/tree/f6924ad83bc95741f003830892ad4815ca3b70fd/python. Here are some blog posts written then or earlier:
Reciprocity at the source level is not something the ASF is going to change its mind about. Being able to use, but not modify, a piece of code practically makes that code useless in the context of an Apache-2 project IMHO. Even outside of an ASF project (e.g. in pandas or another project) I would not be willing to take on weak copyleft code because of the risk of IP contamination. But perhaps this is venturing into open source ideology. I would advise you to consider changing your license to Apache-2 to be more compatible with the Python ecosystem, which is BSD/Apache-2-centric. I would have liked to have had the opportunity to discuss these things with you 18 months ago but I only recently became aware of this project's existence and I'm reaching out now to see if we can find a way to help each other going forward. Each developer is entitled to their own decisions, but it is a shame that our work is not more mutually beneficial at the moment. To summarize the points I've made:
|
I agree, it would have been nice to have Apache-2 license for the project. I'm not entirely free in this choice. Still, the current situation is much better than if it was GPL. I do not know whether it will be feasible to switch to Apache in the future, but I'd be looking into that direction. At the very least, I believe I can dual-license certain parts of the code that might be useful for other projects. Speaking of GPL. There is a package called plotnine, which is an awesome visualization library, based on R's |
How would Python programmers be able to reuse an R package? I don't think there's anything wrong with releasing GPL packages in R because the whole ecosystem is heavily copyleft. In the Python data ecosystem, copyleft is unusual and will generally result in people either not using or not contributing to a project. I certainly would not. |
@wesm Are you aware of data.table's change of license from GPL to MPL? The PR was Rdatatable/data.table#2456. I'm interested in your thoughts on my thoughts expressed in some detail there. |
I can read that in more detail and comment in more depth when I have some time. In the R world, GPL vs. MPL doesn't seem to make a practical difference since distributing applications written in R will generally carry a transitive GPL dependency. There are situations where MPL would be better though. Personally, my preference is for all contributors to an open source project that I am involved with to have equal, unconstrained freedoms. This includes the freedom to create and distribute a closed source derivative work. In our modern times it is incredibly difficult to sustain open source projects, and to place restrictions on the reuse of works contributed to a project does not seem fair to me. So if I contribute a patch to a project, I want to be able to reuse my patch as I please. And I think it is fair that everyone else that I ask to contribute to the same project have that same freedom. I don't think people who feel differently are "bad" or "wrong", it's just my preference about what kinds of projects I will choose to involve myself with. I understand there are OSS contributors who do not want to allow others to create closed source derivative works, and maybe they won't contribute to an Apache2/MIT-licensed project. That's OK with me; at that point we are basically having a religious debate and I think each person is entitled to their opinion. I don't think there's a license available that will satisfy all parties. |
@wesm Great - yes please do read the PR in full when you've time. I'm hoping you will appreciate that at least I did try to find middle ground. |
hi @mattdowle, I read through your PR. I appreciate the time and thoughtfulness you put into it. I think it articulates the general copyleft sentiment and the use of MPL allows closed source products (like the ones of your employer) to use the open source project in unmodified form without IP contamination. When I spoke about open source ideology above, here is where we diverge
and
I think this is the irreconcilable difference of copyleft adherents and permissive license adherents. I do not feel that permissive licenses are unfair to contributors; quite the opposite: I think that strong copyleft and weak copyleft licenses are unfair to contributors. Why? Well, because, by act of contributing your IP to a project, that IP has basically stopped being yours anymore (because it generally only has value in the context of where it is contributed). It now belongs to the Project, and you cannot use it as you wish. The copyright holders can all agree to change the license later, but if you change to Apache/MIT, as you point out, there's no going back. In permissively-licensed projects, the IP never stops being yours. In fact, the IP belongs to everyone. Your copyright is preserved, and you agree (in the case of Apache-2) that, for any software patents you hold (in the United States), you are permitting users of that software an irrevocable right to use it as they please. Personally, if someone creates "pandas pro" and makes money off it, it'd be perfectly fine with me. That's the deal I've made with the open source community. By the same token, I can customize the software for my own purposes and ship it in a closed source project. I can understand the desire to prevent any of these things from happening. As one small point: one head-scratching bit about MPL 2.0 is that it seems to be a bit of a No-Man's-Land between GPL-like licenses and permissive licenses. It's too liberal to satisfy Stallmanites, and too restricted (because of the inability to reuse code in permissive projects) to attract permissively-minded contributors. I will be interested to see what kinds of individuals are attracted to contributing to these projects. As another point: what license makes most sense is heavily project dependent. I mostly work on "data tool" projects where I have relied on a great deal of code reuse from other projects and have been able to make a lot of rapid progress that way, for the benefit of the open source world. For building "end user applications", where reuse of code might not be as interesting, I could see myself using GPL or contributing to a GPL project. But not for something related to data science or data manipulation. That's just my preference. In any case, I look forward to finding some ways to collaborate when it comes to the Apache Arrow ecosystem -- there's a lot of exciting stuff going on or being planned for the coming years, and I'd like as many projects and ecosystems to benefit from the fruits of our labors as possible. I dream of a less fragmented and more efficient and productive world, and that's what drives a lot of what I do. Thanks for reading |
By the way, one of the stipulations of projects in the Apache Software Foundation is that derivative works are not allowed to use the project trademark to promote their product, except by the limited "Powered by" use. So you could sell "Data Power Tool powered by Apache Arrow" (OK) but not "Apache Arrow Pro" (enforceable trademark violation) |
Thanks @wesm for reading and your thoughts. I believe we are going to meet for the first time at DSC Stanford on Monday/Tuesday so perhaps we can discuss further there. I look forward to meeting you. If pydatatable was Apache-2 copyright H2O, would you contribute/use it? Or need it be Apache-2 copyright Apache Foundation before you would contribute/use?
If I understand correctly, this isn't true. When r-datatable contributors contribute, they still own the IP for their changes. When I changed the license of r-datatable I couldn't do that without asking every single past contributor to the project, which I did. If the IP didn't still belong to the contributors, I wouldn't have had to ask them. They can take their own contributions and do with them as they wish. Contrast this to projects which are owned/copyright by a company, specific person or a foundation. There I find it easier to see how the contributor is giving up their IP to the project owner.
I find this hard to believe! Really? Perhaps you mean that you don't think that will happen, no? Let's say it was called grizzly to avoid the trademark violation and had the same API as pandas so everyone switched to it easily by paying for it.
I agree and this is admirable. I look forward to meeting you on Monday/Tuesday where perhaps we can discuss further. |
Yes, let's definitely talk more in person next week -- looking forward. A couple small points to your comments
h20 copyright would be fine. We acknowledge the copyright holders for third party code that has been reused in part or whole in Apache Arrow: see bottom of https://github.com/apache/arrow/blob/master/LICENSE.txt
From a legal standpoint, you are correct. From a practical standpoint what I'm saying is that the IP may have significantly less value to you because it is usually only useful in the context of the project where the IP was contributed (that's what I meant by "because it generally only has value in the context of where it is contributed"). I imagine that a large fraction of data.table patches are so specific to data.table that they would have little practical value outside of the project, or have to be largely rewritten to be used elsewhere. When you contribute to a permissive project, the value of your IP is not diminished as you can use it freely along with the code that it requires to be useful (the rest of the project and the IP created by others). Regardless of your ideology (copyleft-leaning vs not) I think you must concede that this is the case (restrictions on reuse vs. no restrictions on reuse).
Yep, I am 100% serious (assuming that a trademark violation or otherwise marketing abuse did not occur). Personally, the value I derive from open source is the community and the free exchange of ideas and code. Someone who goes closed source and starts selling something is closing themselves off to the OSS community in a lot of cases. I'm a bit of an anarchist in this regard -- if my IP is used to create a product that is successful or that is disruptive to the status quo in some way, that is a win in my book even if the win for me is not monetary in nature. In the best case scenario (which turns out to not be all that rare), the for-profit closed source project may serve as a source of funding and maintenance support for the project. To play devil's advocate, a certain Major Tech Company has recently garnered a reputation for creating for-profit service offerings based on permissively licensed OSS projects, while giving little or nothing back to the OSS projects. This has definitely tested the faith of some. Personally I think it's a short-term-greedy-long-term-foolish thing -- that strategy can't be run forever if the community-oriented OSS world burns out and begins to fade away. Here's a talk I gave about this recently: https://blog.dominodatalab.com/importance-community-led-open-source/ |
Good discussion folks! From an open source standpoint - H2O.ai (company supporting the amazing team of developers on datatable) is boldly committed to open source and in fact supports or prefers picking the most permissive model - Apache / BSD-type/MIT by default. |
Duplicate with #1819? |
@st-pasha I think this is also not true any more:
Source: https://arrow.apache.org/docs/format/Columnar.html#array-lengths |
This is true. Since Arrow became more accommodating, we are slowly shifting certain details of internal representations to conform to Arrow's format. |
This has been implemented recently: Closing this issue, feel free to reopen if needed. |
Is there any reason why you did not go with apache arrow format from the beginning?
It would be at least nice, if you allowed
to_arrow_table
andfrom_arrow_table
conversions.The text was updated successfully, but these errors were encountered: