Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Missing values by Sentinels #9363

Closed
wants to merge 1 commit into from

Conversation

mrocklin
Copy link

Here is a draft for missing value support by sentinels. This is my first real go at Julia so I suspect I'm doing many things badly. Still, hopefully the general design is sound. More on missing bits at the end of this comment. Work was done with @dchudz.

Sentinels represent missing values (NA) by rarely used values within the domain of a datatype. In particular we use the following values for the following types:

  • Bool: The integer 2
  • Float32/64: The exponent signalling NaN and the mantissa 1954 (the birth year of an R creator)
  • Int32/64: The minimum integer -2**31/63 + 1
  • Complex: The float definition of NA for both real and imaginary parts

We extend arithmetic and mathematical operations to respect the new semantics of these special values. Each operation is now accompanied by an isNA check.

The choice of sentinel values here agrees with the choices made by the R language and subsequently by the DyND project in Python (cc @mwiebe) (article on R sentinels.)

Adopting this as a cross-language standard for missing values might be of some use.

Example

julia> x = Option(1)
1

julia> x + 3.14
4.140000000000001

julia> typeof(x + 3.14)
Option{Float64}

julia> x = Array{Option{Int32}}([1, 2, NA, 4])
4-element Array{Option{Int32},1}:
 1
 2
 NA
 4

julia> x[4] = NA
NA

julia> x
4-element Array{Option{Int32},1}:
 1
 2
 NA
 NA

julia> mean(x)
NA

julia> mean(nonnull(x))
1.5

Issues

I have no idea if this is an appropriate solution for missing values in Julia. DataArray (masked arrays) and Nullable (explicitly boxed) are two other fine solutions in various contexts. This is just a third.

If this is an appropriate solution then certainly my implementation could use a lot of work. In particular

  1. Option should be a subtype of Number. When I do this I run into a large number of dispatch ambiguities.
  2. We need to extend all elementary mathematical functions. Is there a list in Base somewhere? DataArray has such a list. It'd be nice if that were moved somewhere closer to core.
  3. I haven't yet invested in learning how to properly test while developing in Julia. As a result there are no tests.
  4. In many cases when we use Number we really mean {Int32, Int64, Float32, Float64, ...} all of the Optionable types. Option does not extend to non-primitive numeric types.
  5. The singleton NA could use better interaction with promotion rules and such.

Also, presumably people have discussed this before. I'm coming into this without much context. My apologies.

Sentinels represent missing values (NA) by rarely used values within the
domain of a datatype.  E.g. NA_Bool is represented by the integer value
`2`.

We extend arithmetic and mathematical operations to respect the new
semantics of these special values.  Each operation is now accompanied by
an `isNA` check.

```
Example
-------

julia> x = Option(1)
1

julia> x + 3.14
4.140000000000001

julia> typeof(x + 3.14)
Option{Float64}

julia> x = Array{Option{Int32}}([1, 2, NA, 4])
4-element Array{Option{Int32},1}:
 1
 2
 NA
 4

julia> x[4] = NA
NA

julia> x
4-element Array{Option{Int32},1}:
 1
 2
 NA
 NA

julia> mean(x)
NA

julia> mean(nonnull(x))
1.5
```
@ihnorton
Copy link
Member

(some related context: #8152 #8423)

@johnmyleswhite
Copy link
Member

Would be good to get @wesm to chime in here.

@JeffBezanson
Copy link
Member

I think this would be a great package. We probably don't want to add another approach to missing data to Base right now. However the cross-language interop aspect is interesting.

The code is pretty good; I would say you've picked up the language quite well.

The first issue I see is that converting both 1 and 2 to Bool gives true, so that case will be tricky. I'm also not sure the "define every function" (sin, cos, etc.) approach will work well enough. Instead we could start with e.g.

lift_na(f, x) = isNA(x) ? x : Option(f(x.val))

@mrocklin
Copy link
Author

I'm also not sure the "define every function" (sin, cos, etc.) approach will work well enough.

Can you say more about your concerns here?

When I say "every" I really mean these. These lists plus a tiny bit of meta-programming appear to go a long way.

I guess my goal here is to make missing values syntactically invisible to the user in the common case. For all its faults, R does this well.

@mrocklin
Copy link
Author

Happy to move this to a package. Still like to get feedback here if we can keep the PR open for a bit.

Tips on how to get past some of the type issues would be welcome. I suspect I need to read up on Julia's abstract types to define some sort of Optionable > {Int, Float, Bool, Complex} type class.

promote_rule{T<:Number, S<:Number}(::Type{T}, ::Type{Option{S}}) =
Option{promote_rule(T, S)}
promote_rule{T<:Number, S<:Number}(::Type{Option{T}}, ::Type{S}) =
Option{promote_rule(T, S)}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is promote_rule commutative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically you only need to define it in one direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

promote_rule isn't but promote_type uses typejoin(promote_rule(S,T), promote_rule(T,S)) so it is guaranteed to be symmetric, even though promote_rule isn't.

@wesm
Copy link

wesm commented Dec 15, 2014

FWIW, I've actually stopped believing that sentinels are the correct universal solution for representing missing data. It might make sense to have a compatibility layer in Julia for receiving/emitting data with missing values encoded in a particular way (for interactions with R/Python/etc.), but the kicker in my view is that it renders you not 100% compatible with database systems (which use an extra bit or byte to represent nullity, and thus are not burdened by missing bit patterns in each of their data types, e.g. TINYINT in a database can have both -128 and NULL without any problems).

Unfortunately, Python-land is still a bit crippled in this regard, with pandas's hand-wavy support for missing data (a legacy issue of course; we made the best decisions we could under the circumstances and still ship a piece of software that people will adopt broadly) and lack thereof in NumPy.

Happy to speak some more about this sometime; I should write a more well-reasoned and -written blog post ;)

@johnmyleswhite
Copy link
Member

Thanks, @wesm, for chiming in. Much appreciated!

@Keno
Copy link
Member

Keno commented Dec 15, 2014

I think it would be good to have this have the same API as the Nullable type in base. Then the user can choose whichever approach is more appropriate for his or her application.

@nalimilan
Copy link
Member

I agree with @Keno that making this as similar as possible with Nullable is a good idea. Maybe NA would be better called null/NULL? Cf. #9356.

@mrocklin
Copy link
Author

I don't have strong thoughts one way or the other on how to represent missing values. However I will go ahead and champion the sentinel choice for the sake of argument.

In almost all cases finding a value in a data range is free. IEEE Floating point has 53 bits in NaN to play with, Bool has 256 - 2 options, signed integers generally have one more negative number than positive number (e.g. int8 goes from -128 to +127). The only troublesome case seems to be, as Wes points out, uint8. However, this might be a sufficiently important case and I don't see any easy way around it.

In terms of database compatibility it's rare for me to see user code interact directly with database storage systems. Cross-language operation with R seems more common/important to me, my personal experience is a pretty biased sample though.

Moving back a bit from how one represents null values in memory one might consider just how users interact with them. I think that naive x + NA -> NA behavior has real value. Explicit packing/unpacking of Nullables challenges my mental model of how naive users behave. While it's technically clean I think it'll feel "complicated" to some users. If the community decides to go with packing an extra Bool along with the value then I might still suggest an alternative user interface to what currently exists in Nullable.

Mostly though, I think that the main reason to go with sentinels isn't about performance, it's about seemless compatibility with legacy R code.

@StefanKarpinski
Copy link
Member

There are two major issues with using sentinels – performance and lack of composability. Floats do happen to work reasonably well using NaNs as NA (although then you have weird crap like whether NaN + NA is NaN or NA depending on argument order) – this is the case in R:

> NaN + NA
[1] NaN
> NA + NaN
[1] NA
>

For integers, it's worse since the native operations do random stuff to sentinel values, so you have to either put checks around all your accesses or do the work normally and then patch it up at the end (but this is what you do for the case of keeping NA flags separately anyway).

The composability issue is a much more intractable one. How do you support arbitrary data types that can be NA? There's no good way to do it (that I've ever seen).

@mrocklin
Copy link
Author

Yeah, for arbitrary types things do become unclean. The null byte addition is more easily generalizable. Solutions in the sentinel case:

  1. Only support primitive types
  2. Define a function NAof that gives the null value of any type and use that as an extension point (that's what I do here anyway.)

How does performance differ between sentinels and other solutions to represent nullity? There must be a check at some point. I can imagine masked array solutions like DataArray having some performance benefit by doing "checks" in bulk bitwise and. My guess would be that sentinels and the current Nullable solution would be about the same.

@StefanKarpinski
Copy link
Member

The performance advantage for using an external array to store a NA mask is that you can use BitVectors for them and then compute the combined NA mask of u + v as u.nas | v.nas – which computes 64 at a time. I'm not sure what the fastest way to do this same work with sentinel values is. Is it better to do the vector op first, ignoring sentinels and then go through and patch things up? Or is it better to loop over the vectors and compute

w[i] = ifelse((u[i] == SENTINEL) | (v[i] == SENTINEL), SENTINEL, u[i] + v[i])

? I don't know. I also don't know that the version with packed bitmasks is faster, but I'm betting it is.

@mrocklin
Copy link
Author

That matches my intuition as well. I'm not generally comparing scalar sentinels to arrays with separate bitmask arrays. I'm mostly comparing them to things like Nullable.

@StefanKarpinski
Copy link
Member

We wouldn't be storing things as arrays of Nullables, or at least that's not the plan, I don't think.

@ViralBShah ViralBShah changed the title Missing values by Sentinels RFC: Missing values by Sentinels Dec 18, 2014
@nalimilan
Copy link
Member

Moving back a bit from how one represents null values in memory one might consider just how users interact with them. I think that naive x + NA -> NA behavior has real value. Explicit packing/unpacking of Nullables challenges my mental model of how naive users behave. While it's technically clean I think it'll feel "complicated" to some users. If the community decides to go with packing an extra Bool along with the value then I might still suggest an alternative user interface to what currently exists in Nullable.

Yes, I think an alternative API which would be more tolerant to missing values would be very useful for many people. When you know for sure that (many) missing values are present in your data, you really don't want to get any error for them, or you'll spend your day handling errors which clutter your code. As Terry Therneau says, "R succeeded because it's useful", and we need to make Julia as (and more) useful as R as regards missing values for it to succeed on that user segment.

@StefanKarpinski
Copy link
Member

How you represent that a value is missing is orthogonal to how easy it is to use. The main issue is that in Julia 1 is an Int literal whereas in R 1 is a literal for R's integer type which is always nullable. Using sentinel values or using Nullable types makes no difference to usability – you still need a lot of implicit conversions to make things work smoothly. The tradeoff is that implicit conversions make it easy to accidentally write type unstable code.

@johnmyleswhite
Copy link
Member

A couple of points to add to the discussion:

(1) It's very important that any API work with more than just primitive types. If you want Julia to interact with Hive using the same API that we use to interact with SQLite3, we can't allow any divide to exist between primitive types and more complex types like Array columns and Map columns.

(2) Whether user code interacts with databases directly depends on who you think of as a typical user. I would bet you're probably thinking of scientists. But I'm thinking of people who build websites. Those people are already using a bunch of languages that have constructs like Nullable, precisely so that they can seamlessly work with databases without having to figure out whether the database will think that 0xff means 0xff or whether it means NULL.

(3) I'm not convinced that interop with R is very important. Many of the classic R packages are actually wrappers around C or Fortran code that doesn't care about R's sentinel semantics. This is true, for example, of the glmnet package -- which unconditionally prohibits data from containing any missing values. I believe that we need interop with the C code from those packages, but not necessarily with the R wrappers around those C libraries.

(4) Allowing x + NA -> NA is arguably nice, but it's even less generalizable than the idea of adding NAof to every type T. You have to re-define every single function that will ever be written in Julia to work with NA. And you can't do this correctly in general, because NA represents an epistemological construct that doesn't have simple semantics. You might hope that you can pretend that op(..., NA, ...) -> NA for all operations. But you can't, because op(..., NA, ...) -> true for some logical operations and op(..., NA, ...) -> false for other logical operations. And three-valued logic itself produces perverse conclusions for expressions like exp(NAof(Int)) >= 0.

(5) I'm generally not that worried about things feeling "complicated" to users. I think that many definitions of "complicated" ignore the problem that a language that seems "complicated" on day 1 may seem "simple" on day 10, whereas a language that seems "simple" on day 1 may seem "complicated" on day 10. This is my exact experience with both R and Perl: the languages seem more and more complicated as I get to know them better. In contrast, the definition of Nullable is the kind of thing that seems confusing for a few hours and then makes perfect sense for the rest of your life. I have the same feelings about parametric types in Julia. I think they confuse lots of newcomers. Then you come to understand them and they seem as simple as any other language construct.

(6) I think there's not a lot to be gained from saying that "R succeeded because it's useful" without causal evidence to support those kinds of conclusions. R might well have succeeded because of randomness in how programming communities form.

@nalimilan
Copy link
Member

@StefanKarpinski Yes, I think that's what @mrocklin meant in the first sentence I quoted. But I'm not sure about the type instability: as long as operations with one nullable always give a nullable, everything should be stable.

@johnmyleswhite
Copy link
Member

The problem with silent propagation of NULL is that it makes simple things easy and hard things impossible. You can write a 100 line script easily. But your 10,000 line system will fail randomly and will be impossible to debug.

@StefanKarpinski
Copy link
Member

(3) I'm not convinced that interop with R is very important. Many of the classic R packages are actually wrappers around C or Fortran code that doesn't care about R's sentinel semantics. This is true, for example, of the glmnet package -- which unconditionally prohibits data from containing any missing values. I believe that we need interop with the C code from those packages, but not necessarily with the R wrappers around those C libraries.

I agree with this. However, I do think that we should have types like the ones in this PR that provide binary compatibility with R while presenting the same interface as Nullable. That way we can do zero-copy interaction with R and also interact smoothly with databases, etc. This is not necessarily an either or kind of decision. But we do need to settle on a common interface.

@nalimilan
Copy link
Member

@johnmyleswhite I think we have to agree to disagree on the question of silent NULL/NA propagation. We clearly do not have the same use cases in mind. While for 90% of the infrastructure our goals are the same, for the remaining 10% it seems we will need different semantics. We already had this discussion elsewhere, but basically when you know will full probability that NULL/NA values are in your data, your code won't "fail randomly": it will fail all of the time if you assume there are no missings. So you just need propagation, and an easy way to skip missing values.

@johnmyleswhite
Copy link
Member

I agree with this. However, I do think that we should have types like the ones in this PR that provide binary compatibility with R while presenting the same interface as Nullable. That way we can do zero-copy interaction with R and also interact smoothly with databases, etc. This is not necessarily an either or kind of decision. But we do need to settle on a common interface.

I agree that the important part is the API, not the implementation. But they're unfortunately somewhat intertwined. For example, if we want to support the semantics that Evan Miller advocated in his recent blog post, then we need to ensure that every Nullable object contains a value that can be accessed safely, even when that Nullable is marked as isnull.

@StefanKarpinski
Copy link
Member

I've been meaning to read that post (and reply to the email you sent me about it). I will do so and reply here :-)

@johnmyleswhite
Copy link
Member

I figured this would be an effective way to goad you into doing so.

@johnmyleswhite
Copy link
Member

@nalimilan If you write up some code that implements "an easy way to skip missing values", I'm happy to merge it into DataFrames or any other JuliaStats package. But I suspect you'll find it's even harder than adding sane tools for working with rounding modes to Base Julia.

@nalimilan
Copy link
Member

@johnmyleswhite As I said elsewhere, I suspect this will in the end require a separate type whose behavior would be to have skipna = true by default in dedicated function. I don't think that's a technical challenge, but maybe I'm missing something. OTOH it would indeed require redefining all functions (like DataArrays.jl does) to support this; other functions would only work after having removed missing values.

As regards Evan Miller's approach, it could be interesting to try, but I don't see how allocating a vector of weights could be practical or efficient. What could be done is multiplying values by 1 when present and by 0 when missing (i.e. by the negation of the bit mask), without allocating anything, if that's indeed shown to really help vectorization. As you said, that would require another API requirement that values can always be accessed.

But I don't get the interest of his two-layer idea: a way of viewing the same data with two different bitmasks might be useful, but it doesn't need to be supported by default by NullableArrays, eating memory and adding complexity even for people who don't need it. Allowing to share the underlying data, with only different bitmasks would be enough. Same for frequency weights: better handle them in the context of DataFrames. Finally, as regards storing the missingness type (refused to reply, away...), I think it would also be an extension of Nullable: it would implement the same API, considering all missings as NULL, but would also offer a way to get the missingness type (stored separately).

@johnmyleswhite
Copy link
Member

I believe Evan's vision is that the memory cost of frequency weights is low, but that they allow you to write branch-free, SIMD-friendly code.

I agree that the easiest way to handle the two kinds of missingness problem would be to use two layers.

Regarding multiple kinds of missingness, I think that I'm not involved with the survey world enough to see the value in N kinds of missingness, given that I can easily imagine needing N + 1 kinds of missingness -- which I would tend to handle via categorical variables and some well-placed calls to filter.

@simonster
Copy link
Member

Re: Evan Miller's approach, I'm not sure the frequency weights would have to be anything besides an AbstractArray that wraps the NA bitmask. I haven't benchmarked it, but LLVM can (mostly) vectorize:

immutable BitWrapper
     x::BitVector
end
Base.getindex(x::BitWrapper, i) = !Base.unsafe_bitgetindex(x.x.chunks, i)

function f(x::Vector{Float64}, y::BitWrapper)
       z = zero(eltype(x))
       @simd for i = 1:length(x)
           @inbounds z += y[i]*x[i]
       end
       z
end

Note that because of the way we define *(Bool, Float64), NaN*false == 0.0, which completely sidesteps the NaN issue Evan raised and still doesn't need a branch. (Although we might still want to redefine * here for efficiency because we don't care if -1.0*false === 0.0 rather than -0.0.)

@nalimilan
Copy link
Member

Regarding multiple kinds of missingness, I think that I'm not involved with the survey world enough to see the value in N kinds of missingness, given that I can easily imagine needing N + 1 kinds of missingness -- which I would tend to handle via categorical variables and some well-placed calls to filter.

@johnmyleswhite I don't think that's an absolutely required feature for handling survey data, as can be seen from the fact that R does not support this and still is relatively successful in that area. But the idea is that you want to preserve the information about the cause of missingness (which comes as a level of a categorical variable in data files provided by survey institutes), and yet skip/hide these values by default in frequency tables, regressions, etc. Here you see that the question of the default behavior is considered as important enough by software in this field.

Then it also offers a way of distinguishing several types of missingness for numeric variables. This could certainly be handled via a separate categorical variable, but for consistency software generally supports the same pattern as for categorical variable.

@simonster Yeah, that's what I meant.

@johnmyleswhite
Copy link
Member

@nalimilan It seems easiest to handle that problem using a MissingArray type that augments an AbstractArray (which might be a NullableArray or might not) with a treat_as_missing field. Then we don't need to worry about categorical variables or numerical variables.

We can certainly treat the bit array as degenerate frequency weights, but, for the more general case of statistical modeling, we need to actually support real frequency weights. This is something that R does somewhat haphazardly, which makes working with massive data sets quite difficult -- you can usually easily fit a weighted model, but it's much harder to get statistics back out that correctly reflect the weighting.

@nalimilan
Copy link
Member

@nalimilan It seems easiest to handle that problem using a MissingArray type that augments an AbstractArray (which might be a NullableArray or might not) with a treat_as_missing field. Then we don't need to worry about categorical variables or numerical variables.

Well, I'm not sure what you mean exactly, but that's really not my number one priority. I'd say better get one single type of missingness working first.

We can certainly treat the bit array as degenerate frequency weights, but, for the more general case of statistical modeling, we need to actually support real frequency weights. This is something that R does somewhat haphazardly, which makes working with massive data sets quite difficult -- you can usually easily fit a weighted model, but it's much harder to get statistics back out that correctly reflect the weighting.

Agreed, we really need something more consistent than R. I was thinking one of the columns of a DataFrame could be marked as holding the weights. Methods would take a DataFrame and symbols naming columns to operate on, and the weights would be another argument with that column as a default: wmean(df, :x) would be equivalent to wmean(df[:x], weights = weights(df)).

In that scheme, missingness would be completely orthogonal to weights, except that the algorithm could multiply the weights by the BitArray.

@johnmyleswhite
Copy link
Member

I mean that you can have something like MissingArray(values = ["Agree", "Not sure", "Didn't answer"], treat_as_missing = [false, true, true]). In this semantics, you assume that the values array never contains an invalid bit pattern (like you'd get from malloc or calloc when using DataArray).

I'm not really fond of wmean(df, :x) since that makes DataFrames too magical. I'd much prefer that we just try to ensure that every function has a variant that takes weights as an argument and does sane things with those weights.

@simonster
Copy link
Member

I did some benchmarks for simple Float64 summation skipping missing values in JuliaStats/DataArrays.jl#133. The benchmarks indicate that memory access is the bottleneck. The NaN sentinel approach has basically zero overhead, whereas with missingness represented as a Vector{Bool} the benchmark takes about 9/8 as long, and with a Vector{Float64} it takes about 2x as long.

The BitVector approach currently used by DataArrays is many times slower than sentinels without some work, but throwing a lot of tweaks at it makes it ~25% slower than the sentinel approach (not much slower than a Vector{Bool}), and even then I don't think we've hit the limit of what the hardware could do since LLVM isn't vectorizing the loads. Unfortunately it takes some work to get that performance with BitVectors, whereas Vector{Bool} is fast "by default."

@mrocklin
Copy link
Author

Dot product or similar operations might have different results.

@simonster
Copy link
Member

@mrocklin A dot product is twice as many operations and accesses twice as much memory, so I'd expect it to have roughly similar performance characteristics. At least for Vector{Bool} vs. the sentinel approach, I can confirm that this is true; Vector{Bool} adds about ~15% overhead in both cases.

@mrocklin
Copy link
Author

Sorry, by dot product I meant matrix-matrix multiplication or more generally tensor-tensor contraction or more generally, something with higher FLOP/byte intensity.

@ViralBShah
Copy link
Member

@simonster I was wondering about the performance implications of the different methods. Thanks.

@nalimilan
Copy link
Member

@mrocklin Is matrix multiplication really a useful operation when some elements are missing? The missing elements are going to poison a large part of the result.

@simonster
Copy link
Member

If missing values are to be propagated, determining which rows/columns are corrupted is O(n^2) whereas the multiplication itself is O(n^3), so asymptotically there should be no difference between methods.

If missing values are to be ignored, then the complexity of matrix multiplication is still the same as for n^2 O(n) pairwise dot products, but the memory access pattern is different. Actually establishing that any differences are due to storage and not implementation might be difficult: My impression is that a lot of work goes into optimizing gemm, and the pure Julia implementation in generic_matmatmul is about 20x slower on 1000 x 1000 matrices on my system.

@johnmyleswhite
Copy link
Member

I think Simon's analysis is dead on.

It would be great to get a set of benchmarks that describe which operations we want to make fast. Computing something like mean(x::NullableArray) can exploit popcount instructions that might make a BitArray implementation faster than a Vector{Bool}, whereas the mere operation of indexing into a BitArray is costly relative to Vector{Bool} for some other operations. That said, as I claimed in JuliaStats/DataArrays.jl#133, I think we're best off going with Vector{Bool} for now and trying to erect enough of an abstraction barrier that we could switch things in the future if using a BitArray turns out to be better.

That said, I think the goal of this PR wasn't performance: I think it was having missing data stored in a format that can be passed to R. I think that is something worth doing well (and this PR is a great start), but I'd prefer to view the R sentinel values format as a serialization format that we can spit out, rather than our native data format.

Another question raised by Milan is where you draw the line and stop implementing functions on NullableArray objects. I think matrix multiplication should be implemented, but I think we shouldn't bother with things like QR decompositions. How to get that line right is really hard. Part of the reason that the Nullable pushes you to use get everywhere is to resolve this question with a simple, hardline answer: nothing should ever be implemented on Nullable objects. Don't try to define digamma on Nullable, only to realize you forgot polygamma. For NullableArray objects, I'd say that we should keep no more than what DataArray objects support now -- and possibly a good deal less.

@mrocklin
Copy link
Author

I was just trying to find an example proxy operation that has higher compute intensity. Judging performance on linear algorithms is likely to be just a judgment on compactness in memory.

@simonster
Copy link
Member

My intuition is that, with higher compute intensity, the method of encoding missing values should matter less. At some point it becomes fastest just to branch on missingness.

@simonster
Copy link
Member

@mrocklin I agree that we need to do test more operations, but I think it's mostly about access patterns, e.g., summing matrix rows with BitArray missingness totally blows. Feel free to give input (or code) for representative benchmarks at JuliaStats/DataArrays.jl#133.

@quinnj
Copy link
Member

quinnj commented Apr 28, 2016

I think we should close this. There was quite a lot of good discussion, but it seems that actual development has gone the direction of Nullable and the newer NullableArrays package. This, of course, doesn't preclude the option of someone creating a Sentinels.jl package to explore the option of using sentinel values for representing missingness; that kind of package would be readily accepted into the METADATA.jl repo.

@quinnj quinnj closed this Apr 28, 2016
@mrocklin
Copy link
Author

Closing seems appropriate to me. Thanks for the chat.

@nalimilan nalimilan added the missing data Base.missing and related functionality label Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.