-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About SparseVectors #11324
Comments
I am in favour of (B) and to get it all right for 0.4. I don't like just removing the functionality and then adding it again later, which would break codes yet again. This will be an improvement over what already exists, IMO. |
Agreed that the implementation in SparseVectors.jl is an improvement over what already exists. Don't take my objection over the maturity and name too seriously if there's a consensus for option (B). |
The "inconsistency" between dense and sparse arrays doesn't go away by adding a sparse vector type. Even with a sparse vector, there is still no sparse arrays for dimensions higher than two, so I don't buy the consistency argument and therefore also not the "unnatural" argument. It might be very useful with a sparse vector type, but I think we should focus of the class of operations that could benefit from a sparse vector type and some performance measures. |
Here's your inconsistency argument:
I'm repeating myself, but I don't care that sparse doesn't generalize beyond dimension 2 yet. We're doing dimension 1 wrong, along with the operations that result in dimension 1 from dimension 2. Let's just fix that for now. |
It is these kinds of inconsistencies I don't like, which make the system feel lacking. I wish we had N-d too, but that is likely to happen in JuliaSparse rather than Base. I also feel that a sparse vector will be useful with non-numeric data types as well. |
Although, for non-numeric data types, what is the type for fields not represented? |
@andreasnoack, the storage for a sparse vector and a 1-column CSC matrix is essentially the same, I don't think there are any performance differences. The argument is that the current semantics are wrong. |
I still think that it would be useful to see examples application of sparse vectors. The indexing behavior of special matrix types is hard. It is a standard attack on the special matrix types and one of the reasons they might not fit well as subtypes of |
One common sparse vector application:
|
The SparseVectors package was initially motivated by the need to represent sparse samples in machine learning applications (@johnmyleswhite's example is an important case). For sparse vectors, we often want to compute dot products, compute distances between them, add a sparse vector to another vector (in the gradient descent step), etc. Matrix-vector product is also very useful. These are real applications. Look at recent papers on stochastic optimization, many new algorithms studying the procedures that work on sparse samples, one at a time. |
Looks like, the general opinion is towards option (B). I will make a PR later this week. |
Thanks for the examples. I think all of them could work for vectors represented by |
I preface this with the caveat that I do not do sparse linear algebra, but I have been playing with indexing a lot lately. I've not tried this yet, either. Would it be possible to add a "phony" dimensionality to the CSC type, allowing it to pretend to be one or multi-dimensional? All indexing would be reduced or expanded to its canonical two dimensions. This will be very inefficient for high dimensional arrays, but it should be fairly easy to make it semantically correct and consistent (especially if we get #10525 and #11242). The JuliaSparse organization can provide more specialized types that are more efficient in the non-2-dimensional cases. |
Somewhat OT: Since SparseVectors have decidedly numeric meaning in Julia (i.e. it is zero values that are removed, instead of null values), how would you name a package for storing efficiently sparse vectors of any type, where many values may be null? [think of JSON arrays, RDBMS rows, etc.]... Does something like that already exist, and I just haven't found it yet? Thanks! |
Scott, check out the DataArrays.jl package. On Tue, May 19, 2015 at 8:29 AM, Scott P. Jones notifications@github.com
|
@ScottPJones Nullable types, or DataArrays potentially. @mbauman that sounds finicky and more complicated than just adding a separate type for 1-d. If we're going to distinguish vectors from matrices within the type system for dense arrays, consistency really demands that we do the same for sparse. Faking it adds complexity to the CSC type, whose internals are already quite commonly used and relied upon not changing by packages. I have an application where I build up a sparse CSC matrix representation column-by-column, often reordering columns but rarely changing their contents. For this, an array-of-SparseVectors is an ideal representation. Representing this as a 1-column CSC matrix per "vector" adds overhead for an extra integer field and an extra 2-element integer array for the column pointers for each object, which I'd really rather avoid since my columns have very few nonzeros on average. |
@tkelman, I would love to be able to vcat an array of sparse vectors to get a sparse CSC matrix. I think I'm coming around on this. |
hcat, you mean? |
Yes, hcat |
I'm going to have to disagree with you pretty strongly there. I use sparse matrices very heavily, but I do essentially zero linear algebra with them in Julia. Why? Because Julia's sparse matvec and matmul are still quite a bit slower than doing it in C++ or using MKL's sparse blas extensions, and the sparse factorizations available in Base aren't complete enough for my needs (SuiteSparse has no Bunch-Kaufman indefinite LDL^T). My sparse matrices do have linear (and often custom nonlinear) algebraic meaning, but that gets sent to a compiled solver to do all the actual work. So indexing, reductions, data structure traversal and creation within Julia are really important to me on sparse and structured array types (re: JuliaLang/LinearAlgebra.jl#136). I should send Jiahao an email and see about visiting MIT and working on it for a few weeks leading up to JuliaCon. Eventually Julia will be fast enough to write the solvers themselves - if I can get diagonal-D quasidefinite factorizations from Cholmod to work properly I've been meaning to write a Mehrotra QP solver as a proof-of-concept. Will be interesting to benchmark against Ipopt. |
@jiahao is going to be travelling from now to JuliaCon. I will probably be around MIT a few days before and after JuliaCon - so would love to push on sparse matrix work in any way possible. |
@mbauman The CSC/CSR format is dense in one dimension. It doesn't generalize well to higher dimensions, where you really just want the COO format - representing indices as tuples. I have thought about faking it in CSC, but you just get extra overhead for processing vectors. I would love to make the two representations close enough, so that one can write generic code for iterating through either SparseMatrixCSC or SparseVector. |
@ScottPJones You can store anything in Julia's sparse data structures already:
|
@ViralBShah What about the following? julia> z = sparse([1,2], [1,2], ["hello", 1.5])
2x2 sparse matrix with 2 Any entries:
[1, 1] = "hello"
[2, 2] = 1.5
julia> z[1,2]
ERROR: `zero` has no method matching zero(::Type{Any})
in getindex at sparse/sparsematrix.jl:773 The problem is that Julia's sparse data structures all have 0 not be represented, but for RDBMS rows, you want NULL to not be represented.... |
Julia's sparse data structures although not strictly restricted to numeric values, have only been used with numeric values so far. Hence all the rough edges for non-numeric data. This discussion should really be on the mailing list and not here in this issue, where it is off-topic. |
This is a bugfix. Read the entire thread, this is fixing a semantic inconsistency in the way the current sparse matrix implementation interacts with 1-dimensional indexing. Sparse matrices are far too large, important, and widely used to consider removing before 0.4. We're already months behind schedule, let's not make it any worse.
Considering this has gone poorly every time we've tried it, I have serious reservations about doing this again until we work out how to do it more smoothly. For one thing I've suggested leaving functionality in base but moving it into independent modules instead of exporting it by default, which would leave the functionality still available but behind an import, and cause a much easier-to-deal-with breakage in packages. |
@tknopp @StefanKarpinski, I was on the fence about including sparse vectors in Base, but really this addresses a significant usability and semantics issue with the current implementation, and better yet there's already code to do so and @lindahua is willing to push through a PR. Agreed with @tkelman. Let's make the incremental improvement now and not create a huge new time sink for everyone to deal with. |
Precompilation will be addressed in time, more significant is what consequences saying "sparse matrices are not important enough to worry about in base" would have to the package version of sparse matrices being able to work well at all. If base Julia ignores the existence of sparse matrices and writes all routines in a way that they can only ever be efficient for dense matrices, that will be crippling. You need to rewrite all of LinAlg, every reduction, all of broadcast, all indexing and concatenation over again for every sparse matrix type. And again, squared, for all conversions and linear algebra operations between different types. We need a better framework in base first, for which SparseMatrixCSC can be the first implementation - ideally a small, pluggable one that could live out in a package, but the framework needs to exist first. |
@ViralBShah wrote:
This attitude is what I object to. We had this ongoing steady stream of broken sparse matrix stuff for a long time. We are past the point where it's acceptable to have that in Base – we're trying to get rid of that kind of thing; it's certainly shouldn't be adding more of. |
Stefan, I don't understand why you think it's suddenly fixed. It isn't yet. |
It's better than it used to be and I object to introducing more half-baked functionality into base. |
@StefanKarpinski, I think @lindahua's proposed PR is already or can be brought to a state that's not half baked before merging. |
This isn't half-baked. Things have only gotten better by gradual improvement. Which this is a case of. |
Why should being to large be an argument for not moving it in a package? Importance and widely used are pretty subjective. In my subjective opinion PyPlot is a lot more important and widely used than the sparse module. If I understand Tony correctly there will be major overhaul for the sparse module in the future (e.g. generalization to CSR matrices...) and in my opinion there is no benefit of having this development in base. It will make the development even much harder than if it would be developed in a package. Or does nobody see this point? |
@tknopp, I agree with you and have stated before that CSR shouldn't be developed in base, but that's not related to this PR. Sparse vectors aren't a major overhaul, they're a bugfix. |
@mlubin: This is not a PR yet but a question how to proceed. If you all want this than simply go ahead and do it. The reason I started this discussion is to show the advantages of this being in a package. And I still have not seen an argument why it would be a bad thing. |
We agree with you on everything except timeframe, and maturity of the code removal process. 0.4 shouldn't be removing any more code. 0.4 should also stop adding code that isn't fixing bugs for that matter, and try to get out the door before fall. The bug here is There's also a non-negligible risk of the sparse code bitrotting and losing a lot of visibility if it gets moved out of base. Sparse matrices will need the right infrastructure to exist in base first (which will have to be developed here) before they can work well as a package, and just as importantly they will need enough people to maintain them or they will wither and die. |
This is where I object to. If there is not enough momentum to maintain this then the code is simply not important and should definitely not be in base. Visibility loss is also something which can be said about almost every package and we need other mechanisms to highlight "must have" packages. |
Completely wrong. The perfect time to make major, significant, breaking reorganizations and changes is early in a -dev cycle. We're not early in a -dev cycle, we're overdue for determining scope completion to close the -dev cycle to move to -pre and a release. We've blown past the 6 months for -dev then 3 months in -pre plan already.
Not until a replacement is available, tested, and working. And Base is capable of supporting that package working at all.
We're way off topic, but the counterargument to that is how few commits to base are necessary to make use of this type of code once it's there and in place. Miles, Iain, etc have very few base commits, but their packages are a major driver of adoption. If Julia has a single killer app right now where it's undoubtedly better than the competition, it's OR, which you can't do at all without sparse matrices. Let's not cause major breakage there by deleting code without preparing a replacement first. |
I think this is the core point of Tony's argument and it's completely accurate from what I can see. |
Yes you are right, I agree with that
These discussions always end up with a statement that we are off topic. But where if not here should these discussions happen?
I did not say first remove it and then do the package. Of course the order is the other way around. |
|
OR = operations research, aka optimization, so JuMP, Convex, MathProgBase, the solver bindings, and to some extent Optim. Yes these require external packages, and it probably wouldn't be the end of the world to add an additional dependency package to the mix, BUT, it needs to be done right. It's not yet clear to me, as things stand right now, if taking sparse matrices out of base wholesale can leave both parts working as well together as they did before moving them. edit: I do think this is something we should eventually do, but it's a substantial amount of work and we shouldn't rush it. Either early in 0.5-dev or early in 0.6-dev, if someone wants to push on this then it's worth doing at that time. |
We all want the same thing here, but I don't find the proposal to block a bugfix in favor of a major reorganization as we're trying to finish up 0.4 very convincing. @lindahua has put his valuable time and effort into developing sparse vectors, and we all have better things to spend our time on than dealing with the fallout of removing everything sparse from Base. |
+1 |
Reading this thread in retrospect taking into account Virals change in the milestone hopefully shows that it could be a good move to decouple the sparse array development from Julia... |
@tknopp there's still no particularly clean way of both fixing the indexing inconsistency bug with SparseVectors as a package while also avoiding massive package breakage or name conflicts by trying to move sparse matrices wholesale into packages. If you would like to start helping do some of the work to prepare package migrations for some of the functionality that's in base right now, please go ahead. |
@tknopp I request not bringing off-topic discussions into issues. I also do not see how this conclusion is drawn - but this is a discussion to be had elsewhere. If you do wish to continue this discussion, I suggest the mailing list or an appropriate issue. |
Can this be closed in light of #13440? |
There's been a lot of debate about whether we should incorporate sparse vectors into Julia 0.4.
Here are three options:
SparseMatrixCSC
to emulate a sparse vector. This, as many have pointed out in other issues, breaks the consistency with the dense arrays. It is also unnatural and inefficient.SparseVector
to base before 0.4 is released. The functionalities have been developed and tested in SparseVectors.jl. These functionalities are relatively simple and they don't need a long time to get mature. ([WIP] SparseVector support + tests #8718)SparseVector
to base after 0.4 is released. (Remove methods for creating "sparse vectors" #11323)Personally, I don't like (A), and I am fine with either (B) or (C).
One opinion against (B) is that there can be multiple ways to represent a sparse vector. However, IMHO, the way used in SparseVectors.jl seems to be the most natural way, and it is consistent with both CSC and COO formats. I understand that other formats may be more suitable for certain situations. For such cases, it is not difficult to develop a new sparse vector format (using a different type name of course).
cc: @tkelman @ViralBShah @StefanKarpinski @mlubin
Since @ViralBShah has tagged #8718 as 0.4, and that PR was closed, I tag this as 0.4 in order to make sure that we come to a decision before 0.4 is released.
The text was updated successfully, but these errors were encountered: