Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposition] mask-based implementation for missing values #441

Open
tnielens opened this issue Jun 7, 2022 · 3 comments
Open

[proposition] mask-based implementation for missing values #441

tnielens opened this issue Jun 7, 2022 · 3 comments

Comments

@tnielens
Copy link
Collaborator

tnielens commented Jun 7, 2022

Currently, saddle uses one value of each primitive type to represent NA. For floating point numbers, this is straightforward as they already include such a value. For other types (Boolean, Byte, Int, etc), it isn't straightforward and an arbitrary value must be used. Currently the minimum value is used (Byte.MinValue, Short.MinValue, etc).

I think this approach has important drawbacks:

  • users are unlikely to know about that encoding and could use the min values. That leads to surprising behavior.
  • operations resulting in the MinValue would result in a missing value.
  • the binary operations on collections lose in simplicity. Implementation such as if (tag.isMissing(v1)) v1 else v1 + 2 might prevent loop optimizations of the jvm to kick-in.
  • the .raw(i)-like api exposes unnecessary complexity to the user.

An alternative approach would be to use a mask-based implementation for the integer-based Vec[T]s. That is, the vector stores a companion Array[Boolean] indicating missing value. This approach is used by pandas.

@pityka
Copy link
Owner

pityka commented Jun 7, 2022

Some questions arise:

  • would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?
  • what would be the return type of the method which extracts a value? (Now raw )
  • is the mask sparse or dense?

At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.

@tnielens
Copy link
Collaborator Author

tnielens commented Jun 8, 2022

would the binary operations execute the op in the missing positions (to omit the branching) then union the mask?

That's an approach indeed. So as to make sure to enable vectorization. To be validated with benchmarking. Also, the vector could have a field hasNAs. If the field is false, the whole NA processing could be skipped.

what would be the return type of the method which extracts a value? (Now raw )

There are two variants if I understand correctly. The boxed one returning Scalar. That one would remain untouched. The raw one could also remain untouched as well with the note that if the corresponding position in the vector is a NA, the value returned has no meaning and could be anything.

is the mask sparse or dense?

I'd start with dense for simplicity? also along the hasNAs idea above, the bool array could not be allocated if the constructor ensures there is no NA values. That said, since Vec is a mutable structure, that might not be good idea.

At the moment saddle provides two copies from many operations (eg max, max2) one omitting the missingness check the other respecting missigness. All code in linalg is oblivious to missingness.

These aspects of saddle are surprising imo. I assumed these 2 implementations were semantically equivalent.

@pityka
Copy link
Owner

pityka commented Jun 9, 2022

I see. I think I concur with most of these.

My only remaining concern is that I believe there should be a way to extract a Double from a Vec[Double] without going through Scalar, except if we are sure that the VM eliminates the allocations the Scalar on branch less code path (e.g. vec.apply(0).get). This is to enable tight loops on a Vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants