Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The constructor DataType::Decimal(usize, usize) is unvalidated #2362

Open
HaoYang670 opened this issue Aug 8, 2022 · 14 comments
Open

The constructor DataType::Decimal(usize, usize) is unvalidated #2362

HaoYang670 opened this issue Aug 8, 2022 · 14 comments
Labels

Comments

@HaoYang670
Copy link
Contributor

HaoYang670 commented Aug 8, 2022

Describe the bug
Also related to the Decimal. Currently, the constructor DataType::Decimal128/256(precision: usize, scale: usize) is unsound, because users can put any value in it. Also this leads to 2 kinds of bug in the code:
1. forget to check the value of precision and scale
2. redundant checking.

Expected behavior
I’d like to eliminate this unsoundness by using stronger type, which means create a new type for precision and scale:

struct DecimalInfo{
    precision: usize,
    scale: usize,
}

impl DecimalInfo{
    fn new(precision: usize, scale: usize) -> Self {
        assert (precision <= max_precision);
        assert (scale <= max_precision);
        assert (scale <= precision);
        Self {precision, scale}
    }
    fn get_precision {...}
    fn get_scale {...}
    fn set_scale {...}
    fn set_precision {...}
}

enum DataType {
    Decimal128(DecimalInfo),
    ...
}

Additional context

@HaoYang670 HaoYang670 added the bug label Aug 8, 2022
@HaoYang670
Copy link
Contributor Author

cc @alamb @viirya @tustvold @liukun4515

@liukun4515
Copy link
Contributor

I want to go through the code from c++ or java version, and comment this issue.

@liukun4515
Copy link
Contributor

I want to go through the code from c++ or java version, and comment this issue.

Because this changes will cause some api changes.

@viirya
Copy link
Member

viirya commented Aug 8, 2022

C++ Decimal type has such check, e.g. Decimal128Type: https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/type.cc#L870-L874

But I am not sure if we need to have the check. For unsafe paths, we run value validation which includes such check. For safe paths, I think it assumes the inputs are all valid (including decimal type). You can mess the array with either wrong type or input in safe paths.

It appears to me that's a bit verbose when constructing a decimal type with a struct or matching a decimal type. So I personally feel that it has a little advantage but also has some more disadvantage.

I don't feel strong against it, though.

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Aug 8, 2022

For unsafe paths, we run value validation which includes such check.

Here the problem is that we may do redundant validation.

For safe paths, I think it assumes the inputs are all valid (including decimal type).

That's the concern. safe means soundness. However DataType::Decimal(usize, usize) is unsound because it allows you to do False things without compile (or runtime) error. For example Decimal(10000000000, 10000000000) is invalid but you can do this.

@viirya
Copy link
Member

viirya commented Aug 8, 2022

Here the problem is that we may do redundant validation.

For unsafe paths, I think we always do validation, no? Not sure if we can remove validation on unsafe paths. That is why they are "unsafe". I think the point of removing redundant validation is on the paths we think it is safe to do.

That's the concern. safe means soundness. However DataType::Decimal(usize, usize) is unsound because it allows you to do False things without compile (or runtime) error. For example Decimal(10000000000, 10000000000) is invalid but you can do this.

I think that you can also do similar thing with safe paths, not only for this decimal type constructor. As we don't validate (not just for decimal arrays) values in safe paths, you can put some unreasonable input there and cause unexpected result. As it is safe paths, I think that the responsibility is on users hands.

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Aug 8, 2022

As it is safe paths, I think that the responsibility is on users hands.

Sorry, I can't 100% agree with you.

For safe (soundness) path, the responsibility in on the software side. Software should make sure all things it allows users to do are valid, otherwise, it is unsound.

For unsafe (completeness) path, the responsibility is on users hands, however. That is why we often see code like this in our crate:

// Soundness: users should make sure ...
unsafe {fn xxx_unchecked()}

@HaoYang670
Copy link
Contributor Author

I guess @tustvold has more concern on the soundness of the code.

@viirya
Copy link
Member

viirya commented Aug 8, 2022

Ah, yea, that's right. We did validation on safe paths, no unsafe paths...I thought "safe" and "unsafe" reversely. Please swap "safe" and "unsafe" in my previous comments.

(Updated)

But I am not sure if we need to have the check. For safe paths, we run value validation which includes such check. For unsafe paths, I think it assumes the inputs are all valid (including decimal type). You can mess the array with either wrong type or input in unsafe paths.

@HaoYang670
Copy link
Contributor Author

Ha, we reach a consensus! 😁
That's why I say DataType::Decimal(usize, usize) is unsound.

@viirya
Copy link
Member

viirya commented Aug 8, 2022

I agree that the current decimal types constructors have the mentioned issue. It can be used to construct invalid decimal types.

My point above is, we have certain checks for it during performing value validation, in the places we want to make sure the data is valid (safe paths, 😄 ). So it will be caught nowadays.

For the unsafe paths, we have many ways to mess the inputs, not just with the type constructor. So I don't think it is an issue there. Users should know what they're doing with such APIs.

The only holes, I guess, are where we don't do validation so invalid decimal type is not caught. But we also don't want to run real decimal value validation if we think it is computation heavy.

Alternatively, we can do only precision check on such places, instead of full value validation.

@viirya
Copy link
Member

viirya commented Aug 9, 2022

I don't feel strong against it, though.

BTW, as mentioned above, I'm okay for the proposed change if most of you think it is good to go.

@HaoYang670
Copy link
Contributor Author

Ok. I will file a PR and then we can have more discussion.

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

My point above is, we have certain checks for it during performing value validation, in the places we want to make sure the data is valid (safe paths, 😄 ). So it will be caught nowadays.

I agree with @viirya

I think perhaps we were using different definitions of "unsound" . I was erroneously thinking it meant "unsafe" in the classic rust sense that doing so would allow reading/writing uninitialized memory, reading/writing out of bounds of the allocated memory, or data races.

It doesn't mean that user's can't provide bad inputs that result in Err or panic!, though I think many rust libraries try and do thorough error checking and uphold strong invariants

I think it is a good idea to make the arrow library be better about error checking in general as long as it isn't too cumbersome to use. Thank you @HaoYang670 for starting this conversation

@tustvold tustvold changed the title The constructor DataType::Decimal(usize, usize) is unsound. The constructor DataType::Decimal(usize, usize) is unvalidated Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants